The journey when selecting a topic for one’s master’s thesis can be an inspiring and tricky endeavor. Our brilliant software engineer, Dimitris Kalouris, shares the challenges and rewarding results of his master’s thesis, Automated Feature Engineering on Relational Data.
Link to thesis 👉 Automated Feature Engineering on Relational Data
Master Thesis Series: Automated Feature Engineering on Relational Data
Data stored in relational databases represent a great challenge for traditional Machine Learning algorithms since they store information using multiple tables and relationships between tables. This sort of information is difficult, or even impossible, to fully capture as it is exploited by algorithms that expect a simple matrix of data as an input. Solutions to this issue fall under the area of the so-called “relational data mining,” i.e., methods that summarize database information into scalar features that can be, in turn, used for modeling (“relational feature generation”).
In my work, the relational feature generation problem has been transformed into an AI search problem around a table whose attributes we want to model. Within this model, states correspond to the result of any combination of join or aggregation actions between the tables. The goal is to reach states that include features we can use for modeling. This, of course, generates a huge number of features which is why I also developed a scalable feature selection algorithm that works by seeing the data in parts. I showed that this abstraction is able to cover all the feature types generated by previous approaches and also that it manages to capture additional information in the case of more complex graphs with multiple edges between tables, achieving a +30% AUC score in those cases, at the cost of additional running time. Implementing this work was challenging but, at the same time, rewarding since it involved researching and combining knowledge from the areas of Artificial Intelligence, Machine Learning, and Database Systems.