Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

E-mail Classification via Machine Learning in Example of Estonian Road Authority

Name

Risto Hinno

Abstract

The aim of thesis is to create a framework for e-mail topic detection and e-mail classifi-cation using data from Estonian Road Authority. In theoretical part, an overview of text mining including topic modelling and document classification is given. In topic model-ling, the focus is on model LDA and finding optimal number of topics. In document clas-sification, models Naïve Bayes, SVM ja fasttext are introduced. Methods for improving classification model accuracy are described: changing data representation, ensemble methods and calibration. In empirical part, data is prepared and aforementioned models and methods are applied. Optimal number of topics varies between different methods and is subjective. Coherence enables semi-automatically detect optimal number of topics. It is important to have sufficiently cleaned data for topic modelling. Topic modelling could be used for annotating data for classification. After annotation several classification models were trained to assess their accuracy. The most accurate model was created using ensemble method stacking. The most accurate model without using any other method was linear SVM. First 20 most accurate models difference in accuracy was up to 0,02 units. The created framework could be used for analyzing and classifying e-mails in oth-er institutions to automate the answering process.

Graduation Thesis language

Estonian

Graduation Thesis type

Master - Conversion Master in IT

Supervisor(s)

Kairit Sirts

Defence year

2018

PDF

UT Institute of Computer Science Graduation Theses Registry

E-mail Classification via Machine Learning in Example of Estonian Road Authority