Large Scale Data Analysis Using Apache Pig
Name
Jürmo Mehine
Abstract
This work describes Apache Pig, a software framework designed for parallel data processing. An example data analysis problem is presented and solved using the framework. The objective of the work is to demonstrate the usefulness of Pig for large scale data analysis.
Pig is built to work with the parallel computing framework Hadoop, which implements the MapReduce programming model. Pig acts as a layer of abstraction on top of MapReduce, presenting data as relational tables and allowing for data manipulation and queries in the Pig Latin query language.
The data analysis problem used to test Pig involved collecting news stories from on-line RSS web feeds and identifying trends in the topics covered.
As the solution, a number of Pig scripts were created to perform the necessary tasks and a Java application was implemented as a user interface wrapper for the Pig scripts.
Graduation Thesis language
English
Graduation Thesis type
Master - Information Technology
Supervisor(s)
Satish Srirama, Pelle Jakovits
Defence year
2011