PaaS Cloud Service for Cost-Effective Harvesting, Processing and Linking of Unstructured Open Government Data

Name
Mailis Toompuu
Abstract
The aim of this project is to develop a cloud platform service for transforming Open Government Data to Linked Open Government Data. This service receives log file, created by web crawler, with URLs (over 3000000) to some open document as an input. It then opens the document, reads its content and with using "Open source tools for Estonian natural language processing" (Estnltk), finds names of locations, organizations and people. Using Psython library "RDFlib", these names are added to the Resource Description Framework (RDF) graph, so that the names become linked to the URLs that refer to the documents. In order to archive current state of accessed document, this service downloads all processed documents. The service also enables monthly updates system of the already processed documents in order to generate new RDF relations if some of the documents have changed. Generated RDFs are publicly available and the service includes SPARQL endpoint for userss (graphical user interface) and machines (web services) for cost-effective querying of linked entities from the RDF files. An important challenge of this service is to speed up its performance, because the documents behind these 3+ billion URLs may be large. To achieve that, parallel processes are run where possible: using several virtual machines and all CPUs in a virtual machine. This is tested in Google Compute Engine
Graduation Thesis language
English
Graduation Thesis type
Master - Software Engineering
Supervisor(s)
Peep Küngas
Defence year
2015
 
PDF Extras