Design and Implementation of an Incremental ELT Pipeline for a Jira Data Warehouse using Data Vault 2.0 Methodology and HP Vertica

Name
Rasmus Bobkov
Abstract
This master's thesis outlines the design and implementation of a containerized ELT pipeline for TEHIK, a company requiring an efficient way to analyze Jira Software data. The pipeline is designed to incrementally load data into a Vertica data warehouse, constructed following Data Vault 2.0 principles. The containerized architecture enables easy deployment in production environments. Considering the extensive breadth of the subject, the thesis aims to provide an overarching understanding of data engineering, Data Vault 2.0, Agile methodologies, and implementation. Instead of delving into intricate specifics of each area, it focuses on presenting a broad perspective, offering a more comprehensive view of these fields.

The thesis begins by examining the current system, underlining its limitations, and then introduces the proposed solution, emphasizing its advantages. The Background Knowledge and Related Work section endeavors to provide a solid understanding of the central concepts in Data Engineering, Data Warehousing, and the Data Vault methodology, along with deployment in production environments. This section touches upon key topics such as ingestion, ELT vs ETL architecture, data warehouse architectures, and the essence and benefits of the Data Vault 2.0 methodology.

While the practical application of Kubernetes, logging, monitoring, and orchestration with Airflow is not included in the thesis due to time restrictions, these aspects are still crucial for a holistic understanding of the project. Hence, a conceptual overview of orchestration using Airflow and a theoretical implementation for logging and monitoring are provided.

The implementation section comprehensively explores the project's process, unveiling the specific steps and methodologies employed, the challenges faced, and their respective solutions. The subsequent 'Results and Analysis' section critically compares the proposed solution and the existing one. It evaluates aspects like reporting capabilities, compliance with SLAs, and an analysis of the pipeline's performance, considering its ability to handle large data volumes and scalability.

In conclusion, this thesis delivers a robust, scalable, and efficient solution comprising an ELT pipeline and a Data Vault 2.0-based data warehouse tailored for TEHIK's Jira Software data analysis needs. This integrated solution outperforms the existing system, providing a solid foundation for future enhancements and expansions.
Graduation Thesis language
English
Graduation Thesis type
Master - Data Science
Supervisor(s)
Feras M. Awaysheh, Phd
Defence year
2023
 
PDF