E-mail Classification via Machine Learning in Example of Estonian Road Authority
Name
Risto Hinno
Abstract
The aim of thesis is to create a framework for e-mail topic detection and e-mail classifi-cation using data from Estonian Road Authority. In theoretical part, an overview of text mining including topic modelling and document classification is given. In topic model-ling, the focus is on model LDA and finding optimal number of topics. In document clas-sification, models Naïve Bayes, SVM ja fasttext are introduced. Methods for improving classification model accuracy are described: changing data representation, ensemble methods and calibration. In empirical part, data is prepared and aforementioned models and methods are applied. Optimal number of topics varies between different methods and is subjective. Coherence enables semi-automatically detect optimal number of topics. It is important to have sufficiently cleaned data for topic modelling. Topic modelling could be used for annotating data for classification. After annotation several classification models were trained to assess their accuracy. The most accurate model was created using ensemble method stacking. The most accurate model without using any other method was linear SVM. First 20 most accurate models difference in accuracy was up to 0,02 units. The created framework could be used for analyzing and classifying e-mails in oth-er institutions to automate the answering process.
Graduation Thesis language
Estonian
Graduation Thesis type
Master - Conversion Master in IT
Supervisor(s)
Kairit Sirts
Defence year
2018