Hard and Soft Tuning of Spark Ecosystem Toward Query Energy Efficiency

Name
Tofig Bakhshiyev
Abstract
This thesis explores the energy efficiency of executing TPCH queries within the Apache Spark framework, explicitly focusing on diverse file formats (Parquet, CSV, Avro, and TBL) and varying partition sizes in a standalone configuration. The assessment measures energy consumption during the data reading and query processing phases. Initial comparisons are made regarding the characteristics of Parquet, CSV, and Avro formats, analysing their impact on the query performance of Spark. Additionally, the study investigates Spark’s standalone configuration, scrutinising cluster settings, resource allocation, and hardware optimizations that influence energy usage during query execution. An integral part of this exploration involves comprehending how different partition sizes influence energy consumption. The evaluation systematically assesses the impact of partition sizes on IO operations, data shuffling, and overall energy consumption during query processing. Utilising TPCH queries as benchmarks, experiments are conducted across various file formats, partition sizes, and configurations. The outcomes offer practical insights for enhancing energy efficiency in Spark-based big data processing. This research contributes to the broader discourse on sustainable data processing, guiding practitioners to make energy-conscious decisions in Apache Spark environments.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Simon Pierre Dembele
Defence year
2024
 
PDF