Query Workload-Driven Schema Optimization For Processing Large RDF Datasets

Name
Farid Valiyev
Abstract
In the world we live in, data are not only increasing in volumes, but they are also becoming more and more interconnected and linked. In many areas of our daily lives, such as social media, computational biology and protein networks, telecommunications, and many others, graph data models are the most natural, easy-to-understand, and versatile data abstraction to represent the world’s structured knowledge. In fact, the information retrieved via natural language processing and computer vision is currently being represented mostly by Knowledge Graphs (KGs).

KGs are efficient means to represent, integrate and connect data from several heterogeneous data sources. Those applications led to a surge of the popularity of KGs. However, on the other side this brings computational challenges because KGs are growing in massive volumes. Specifically, several applications have used the standard Resource Description Framework (RDF) graph data model to represent, share, and integrate pieces of data on the web.

Therefore, the Semantic Web (SW) community central problem for managing scalable RDF KGs is now in demand. The native graph databases (e.g., Apache Jena, RDF-3X, and Virtuoso) fall short for managing and processing large RDF datasets due to their centralized computational paradigm, i.e., they cannot scale out. Thus, the SW community starts to investigate relational Big Data (BD) frameworks harnessing their scalability and efficiency. Relational systems get a lot of their efficient performance from sophisticated optimizers that leverage relational model, relational algebra simplicity and maturity. Despite the flexibility of the relational solutions, the flexible (i.e., schema-less) structure of RDF graph brings challenges to store and manage RDF graphs in relational schemas. The state-of-the-art shows that, there is no “One-Size-Fits-All” RDF relational schema that can fit all the query workloads. In particular, there is a different winner of RDF relational schema by a large margin for each query type, and the winner in one query family may unexpectedly perform the worst in another.

In this thesis, we argue that combining multiple RDF relational schemas to attain a hybrid one provides better performance for the BD system while querying large KGs. Nevertheless, designing hybrid schema solutions for schema-less KGs require huge data engineering efforts and tailored solutions. To this end, this thesis proposes algorithms that automatically design a hybrid RDF relational schema that adapts to the query workload covering a wide range of query types, without ignoring the loading times, as well as the storage overheads. In particular, we approach this goal via data profiling along with query profiling seeking better data localization, combining relevant data that frequently queried together on the same relations. Our approach reaches to an optimal hybrid schema that consider both the underlying data relationships, as well as the query workloads.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Mohamed Ragab Moawad Mohamed, Riccardo Tommasini, Alexander Nolte
Defence year
2023
 
PDF