Automatic data-driven generation of database schemas

Name
Joel Edenberg
Abstract
The goal of this thesis was to study the possibilities for automatically generating database schemas and to implement a proof-of-concept. All information systems contain some sort of means to store information - often a relational database is used. In order to store the adequate data, we need a suitable database schema. As it turns out similar software applications also share at least partly similar database schemas. So it should be theoretically feasible to generate schemas, or parts of them, automatically. In the first chapters of the thesis we discuss some of the general possible approaches. We proposed a novel algorithm for solving the task of generating schemas. In order to find the most common or most probable solution the proposed algorithm uses a probabilistic model. The algorithm is given a partial list of table names, desired in the resulting schema. These table names represent the data objects user wants to store. In essence the given table names tell the algorithm what data needs to be saved and leaves it up to the program to compose the entire solution. We created one possible implementation of the proposed algorithm (written in Python). Our proposed prototype takes heavy usage of dialogue-like interaction with user. A graphical user interface was also made in order to enhance the working experience and ease the tuning of the algorithm. As the user is the only one who is fully knowledgeable of the requirements, we left several configuration parameters up for fine tuning by user. In addition to given table names user can also determine how many additional tables would be needed (tables that algorithm could find also relevant and append to the solution), should we use database foreign keys too for finding relative additional tables, how many columns do we want in each table and how specific should the column definitions be to the current schema. Next we discussed alternatives solutions, potential improvements and possibilities for future research. In the last part of the thesis we experimentally assessed the perfomance of our algorithm and compared several variations of it. We introduced a novel similarity measure between two schemas in order to estimate the quality of the answers. Due to some specifics of the chosen knowledge base (database containing data about schema examples) our algorithm turned out not to be better than a naive first-match search. However, we believe that in practice using the probabilistic algorithm yields to better results.
Graduation Thesis language
Estonian
Graduation Thesis type
Master - Information Technology
Supervisor(s)
Konstantin Tretjakov
Defence year
2012
 
PDF Extras