Determining Estonian Usual Residents Using Machine Learning Methods

Name
Egle Saks
Abstract
Official statistics play an important role in disseminating knowledge and facts about society, enabling informed decision-making. One of the most important pieces of information disseminated by official statistics is information about the population, with the population size being at the center of this. In an increasingly fast-paced world, information becomes outdated more quickly than ever before, meaning that population statistics are expected more frequently and regularly. The European Commission is preparing a regulation to require the publication of usually resident population twice a year. However, in Estonia, compiling usually resident population using 18 different registers makes more frequent publication challenging.

The aim of this master's thesis is to investigate which data is most important for determining residency and how machine learning models can handle population determination in the context of reduced data. Data used for the purpose of this study is made available by the Statistics Estonia. Principal component analysis is applied to the data, and five different machine learning models are tested. The results show that the reduced dataset performs quite similarly to the original dataset, and a smaller set of registers may be sufficient for determining residency. Among the machine learning methods tested, Random Forest and XGBoost perform the best.
Graduation Thesis language
Estonian
Graduation Thesis type
Master - Data Science
Supervisor(s)
Terje Trasberg, Raivo Kolde
Defence year
2023
 
PDF