Google Dataflow Orchestration Using TOSCA in the Hybrid Cloud

Name
Manish Gupta
Abstract
In today’s world, data is as precious as oil. Many organizations depend on data to make critical business decisions, target specific customers, and accelerate their business growth. This importance of data leads to increased data creation and consumption volume. To process and provide logistics for this tremendous data, one requires a practical and automated approach to data handling. Data Pipeline is a series of interconnected modular tasks that collect, process and make data available to a wide array of systems with minimal manual intervention. There are numerous vendors and open-source platforms that support building Data Pipelines for an organization. However, developers need to have platformspecific knowledge to manage and orchestrate different data pipeline platforms. The lack of standardization for orchestrating data pipelines leads to increased development time and reduced reusability. TOSCA is an open standard used to define topology and orchestration specifications for different cloud services. In this paper, reusable TOSCA components were created in the RADON ecosystem to deploy, terminate, and manage Google Dataflow jobs. RADON is a research project that aims to develop a model-driven DevOps framework for serverless computing. The TOSCA components for Google Dataflow were designed to integrate with existing TOSCA components for Apache Nifi based data pipeline. The integration provides a one-stop solution for developers to build extensive data pipelines combining Google Dataflow and Apache Nifi.
Graduation Thesis language
English
Graduation Thesis type
Master - Software Engineering
Supervisor(s)
Chinmaya Dehury, Pelle Jaokovits
Defence year
2022
 
PDF