Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Suitability of the Spark framework for data classification

Name

Sergei Laada

Abstract

The goal of this thesis is to show the suitability of the Spark framework when dealing with different types of classification algorithms and to show how exactly to adapt algorithms from MapReduce to Spark. To fulfill the goal three algorithms were chosen: k-nearest neighbor’s algorithm, naïve Bayesian algorithm and Clara algorithm. To show the various approaches it was decided to implement those algorithms using two frameworks, Hadoop and Spark. To get the results, tests were run using the same input data and input parameters for both frameworks. During the tests varied parameters were used to show the correctness of the implementations. As a result charts and tables were generated for each algorithm separately. In addition parallel speedup charts were generated to show how well algorithm implementations can be distributed between the worker nodes. Results show that Spark handles easy algorithms, like k-nearest neighbor’s algorithm, well, but the difference with Hadoop results is not very large. Naïve Bayesian algorithm revealed the special case with easy algorithms. The results show that with very fast algorithms Spark framework use more time for data distribution and configuration than for data processing itself. Clara algorithm results have shown that Spark framework handles more difficult algorithms noticeably better.

Graduation Thesis language

English

Graduation Thesis type

Master - Computer Science

Supervisor(s)

Pelle Jakovits

Defence year

2014

PDF Extras

UT Institute of Computer Science Graduation Theses Registry

Suitability of the Spark framework for data classification