Developing a Scikit-Learn Module for a Novel Data Partition for Machine Learning

Name
Rain Vagel
Abstract
Machine learning is the field of using data and statistical models to make predictions. With the help of data partitioning schemes, researchers are able to efficiently test and report accuracies or error values of their models with li- mited data. Depending on the partitioning scheme, other helpful results, such as hyper-parameters of the model, can be returned. A new data partitioning scheme, cross-validation & cross-testing, has been discovered. However it is not yet widely used due to the fact that currently no open-source machine learning library has a function for it. In this thesis we will publish scikit-learn compatible function on Github and also implement it on different tasks. This new function can be used by anybody under an open-source license. Our tests showed that this new partitio- ning scheme might perform slightly worse on regression tasks, than was previously thought. For this we must study cross-validation & cross-testing further, to better understand and to further facilitate its use.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Raul Vicente Zafra, Kristjan Korjus
Defence year
2017
 
PDF