Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Large Scale Data Analysis Using Apache Pig

Name

Jürmo Mehine

Abstract

This work describes Apache Pig, a software framework designed for parallel data processing. An example data analysis problem is presented and solved using the framework. The objective of the work is to demonstrate the usefulness of Pig for large scale data analysis. Pig is built to work with the parallel computing framework Hadoop, which implements the MapReduce programming model. Pig acts as a layer of abstraction on top of MapReduce, presenting data as relational tables and allowing for data manipulation and queries in the Pig Latin query language. The data analysis problem used to test Pig involved collecting news stories from on-line RSS web feeds and identifying trends in the topics covered. As the solution, a number of Pig scripts were created to perform the necessary tasks and a Java application was implemented as a user interface wrapper for the Pig scripts.

Graduation Thesis language

English

Graduation Thesis type

Master - Information Technology

Supervisor(s)

Satish Srirama, Pelle Jakovits

Defence year

2011

PDF

UT Institute of Computer Science Graduation Theses Registry

Large Scale Data Analysis Using Apache Pig