Using SQL-based Scripting Languages in Hadoop Ecosystem for Data Analytics

Name
Madis-Karli Koppel
Abstract
The goal of this thesis is to compare different SQL-based scripting languages
in Hadoop ecosystem by implementing data analytics algorithms. The thesis compared framework efficiencies and easiness of implementing algorithms with no previous
experience in distributed computing. To fulfill this goal three algorithms were
implemented: Pearson’s correlation, simple linear regression and naive Bayes classifier.
The algorithms were implemented in two SQL-based frameworks on Hadoop
ecosystem: Spark SQL and HiveQL, algorithms were also implemented from Spark
MLlib. SQLContext and HiveContext were also compared in Spark SQL. Algorithms
were tested in a cluster with different dataset sizes and different number of
executors. Scaling of Spark SQL and Spark MLlib’s algorithm was also measured.
Results obtained in this thesis show that in the implementation of Pearson’s correlation
HiveQL is slightly faster than other two frameworks. Linear regression
results show that Spark SQL and Spark MLlib are with similar run times, both
about 30% faster than HiveQL. Spark SQL and Spark MLlib algorithms scaled
well with these two algorithms. In the implementation of naive Bayes classifier
Spark SQL did not scale well but was still faster than HiveQL. Results for Spark
MLlib in multinomial naive Bayes proved to be inconclusive. With correlation
and regression no difference between SQLContext and HiveContext was found.
The thesis found SQL-based frameworks easy to use: HiveQL was the easiest
while Spark SQL required some additional investigation into distributed computing.
Implementing algorithms from Spark MLlib was more difficulty as there it
was necessary to understand the internal workings of the algorithm, knowledge of
distributed computing was also necessary.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Pelle Jakovits
Defence year
2016
 
PDF