Automated classification of open datasets to improve data findability on open government data portals
Organisatsiooni nimi
Software Engineering and Information Systems
Kokkuvõte
While many open government data (OGD) portals provide a large number of open datasets that are free to use and transform into value, not all of these data are actually used. In some cases, this is because these data are difficult to find due to the low level of detail presented in them, including, but not limited to the absence or inaccuracy of the category(-ies) and tags assigned to a particular dataset, which is a part of the data publisher task. In the case of some OGD portals, 1/3 of the datasets are not categorized, although the portal provides a rich list of data categories that are in line with best practices and allow to classify these datasets. This leads to cases where the dataset cannot be found if the user searches for data using catalog or tags (only using the search bar will return the dataset, if the search query matches the title or description of the provided dataset). This thesis is intended to propose an automated data classification mechanism, which, based on a dataset and the data provided on it (title, description of the dataset (! please, take into account that you will be asked to carry out at least a simplified text analytics; LLMs use is welcome, but preferably in addition to the above to demonstrate its superiority (if any)), parameters of the dataset (if sufficiently expressive)), will suggest a categories and tags to be assigned to it.
First, the author will be asked to examine the state-of-the-art on the topic, to explore OGD portals and how datasets look like, and what can be scenarios for OGD user to search for a particular dataset. Then, a list of indicators will be defined, which should constitute the input for data classification (mostly in line with the above but can be enriched, if possible), and an appropriate solution will be developed.
Finally, testing of the output should be conducted with users, thereby evaluating the consistency of the result, preferably comparing the level of users’ satisfaction with the current one.
This would contribute to the FAIRness of the open data, although mainly referring to F – findability, but indirectly affecting other features that the OGD should meet in order to provide social, economic and technological benefits from individual users, SMEs and governments.
First, the author will be asked to examine the state-of-the-art on the topic, to explore OGD portals and how datasets look like, and what can be scenarios for OGD user to search for a particular dataset. Then, a list of indicators will be defined, which should constitute the input for data classification (mostly in line with the above but can be enriched, if possible), and an appropriate solution will be developed.
Finally, testing of the output should be conducted with users, thereby evaluating the consistency of the result, preferably comparing the level of users’ satisfaction with the current one.
This would contribute to the FAIRness of the open data, although mainly referring to F – findability, but indirectly affecting other features that the OGD should meet in order to provide social, economic and technological benefits from individual users, SMEs and governments.
Lõputöö kaitsmise aasta
2024-2025
Juhendaja
Anastasija Nikiforova
Suhtlemiskeel(ed)
inglise keel
Nõuded kandideerijale
Tase
Bakalaureus, Magister
Märksõnad
Kandideerimise kontakt
Nimi
Anastasija Nikiforova
Tel
E-mail