Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Large Scale Feature Extraction from Linked Web Data

Name

Madis-Karli Koppel

Abstract

Data available on the web is evolving, and the way it is represented is changing as well.
Linked data has made information on the web understandable to machines. In this thesis we develop a proof of concept pipeline that extracts linked data from web crawling and performs feature extraction on it. The end goal of this pipeline is to provide input to machine learning models that are used for credit scoring. The use case focuses on extracting product linked data and connecting it with the company that offers it.
Built solution attempts to detect if two products from different web sites are the same in order to use one representation for both. Information about companies and products is represented as a graph on which network metrics are calculated. Network metrics from multiple different web crawls are stored in time series that shows changes in graph over time. We then calculate derivatives on the values in time series.
The developed pipeline is designed to handle data in terabytes and built with scalability in mind. We use Apache Spark to process huge amounts of data and to be ready if input data increases 100 times.

Graduation Thesis language

English

Graduation Thesis type

Master - Computer Science

Supervisor(s)

Pelle Jakovits, Peep Küngas

Defence year

2018

PDF

UT Institute of Computer Science Graduation Theses Registry

Large Scale Feature Extraction from Linked Web Data