Automated Tagging of Datasets to Improve Data Findability on Open Government Data Portals

Kevin Kliimask
Efforts directed towards promoting Open Government Data (OGD) have gained significant traction across various governmental tiers since the mid-2000s. As more datasets are published on OGD portals, finding specific data becomes harder, leading to information overload. Complete and accurate documentation of datasets, including association of proper tags with datasets is key to improving data findability and accessibility. Analysis conducted on the Estonian Open Data Portal revealed that out of 1787 datasets published (as of April 23, 2024), 11% of datasets lacked any associated tags, while 26% had only one tag assigned to them, which underscores challenges in data findability and accessibility within the portal. The main goal of this thesis is to propose an automated solution to tagging datasets in order to improve data findability on OGD portals. This thesis presents a prototype application that employs Large Language Models (LLMs) such as GPT-3.5-turbo and GPT-4 to automate dataset tagging, providing tags in English and Estonian. The developed solution was evaluated by users and their feedback was collected to define an agenda for future prototype improvements.
Graduation Thesis language
Graduation Thesis type
Bachelor - Computer Science
Anastasija Nikiforova
Defence year