Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Automated Tagging of Datasets to Improve Data Findability on Open Government Data Portals

Name

Kevin Kliimask

Abstract

Efforts directed towards promoting Open Government Data (OGD) have gained significant traction across various governmental tiers since the mid-2000s. As more datasets are published on OGD portals, finding specific data becomes harder, leading to information overload. Complete and accurate documentation of datasets, including association of proper tags with datasets is key to improving data findability and accessibility. Analysis conducted on the Estonian Open Data Portal revealed that out of 1787 datasets published (as of April 23, 2024), 11% of datasets lacked any associated tags, while 26% had only one tag assigned to them, which underscores challenges in data findability and accessibility within the portal. The main goal of this thesis is to propose an automated solution to tagging datasets in order to improve data findability on OGD portals. This thesis presents a prototype application that employs Large Language Models (LLMs) such as GPT-3.5-turbo and GPT-4 to automate dataset tagging, providing tags in English and Estonian. The developed solution was evaluated by users and their feedback was collected to define an agenda for future prototype improvements.

Graduation Thesis language

English

Graduation Thesis type

Bachelor - Computer Science

Supervisor(s)

Anastasija Nikiforova

Defence year

2024

PDF

UT Institute of Computer Science Graduation Theses Registry

Automated Tagging of Datasets to Improve Data Findability on Open Government Data Portals