Towards Automated Machine Learning: Hyperparameter Optimization in Online Clustering

Name
Dmitri Rozgonjuk
Abstract
Machine Learning (ML) has demonstrated significant potential in data-driven applications, particularly in real-time use cases through online ML, which processes data streams and handles concept drift (changes in data distribution) dynamically. Automated ML (AutoML) seeks to streamline ML pipeline tasks like hyperparameter optimization (HPO) and model selection for improved performance. While some efforts have been made to integrate online ML and AutoML, research on automated online clustering remains limited. This thesis focuses on developing a potential HPO solution in online clustering settings. The aim was to propose an ensemble-based approach that leverages more than one internal clustering validation index (CVI) to address the evaluation problem in online clustering. HPO was implemented on top of the river framework. To compare the performance of HPO in online clustering, two online clustering algorithms were used on six synthetic datasets with ground truth labels. In HPO, models were separately optimized towards two internal CVIs, the Silhouette score and the Calinski-Harabasz Index, and models were compared by using an external CVI, the Adjusted Rand Index. In the experiments, (a) default online clustering algorithms with default parameters, (b) the best optimized online clustering algorithms, and (c) the ensemble of the best optimized models were compared. The findings revealed that the efficacy of HPO varies depending on the data type. In k-centroid-based datasets, the Silhouette-optimized model and the ensemble model outperformed other clustering solutions, while HPO and ensembling did not yield superior results in S-curve datasets.
Graduation Thesis language
English
Graduation Thesis type
Master - Data Science
Supervisor(s)
Radwa El Shawi
Defence year
2023
 
PDF Extras