Optimal Cluster Determination in K-Means Using Gap Statistic Analysis Across Diverse Datasets

Iliyas Karim Khan; Hanita Binti Daud; Nooraini Binti Zainuddin; Rajalingam Sokkalingam; Mudasar Zafar; Soofia Iftikhar; Agha Inayat; Atta Ullah; Abdul Museeb

doi:10.28924/ada/stat.5.8

PDF

Published: May 5, 2025

Vol. 5 (2025), 8

DOI: https://doi.org/10.28924/ada/stat.5.8

Iliyas Karim Khan

Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Hanita Binti Daud

Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Nooraini Binti Zainuddin

Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Rajalingam Sokkalingam

Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Mudasar Zafar

School of Mathematics Acturail and Quantitative Studies (SOMAQS), Asia Pacific University of Technology & Innovation (APU), 57000 Bukit Jalil, Malaysia

Soofia Iftikhar

Department of Statistics Shaheed Benazir Bhutto Women University, LARAMA, Charsadda Road, Peshawar, Pakistan

Agha Inayat

University of Malakand Chakdara Peshawar, Khyber Pakhtunkhwa, Pakistan

Atta Ullah

Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Abdul Museeb

Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Abstract

Clustering is a fundamental technique in unsupervised machine learning, where selecting the optimal number of clusters (ONC) remains a critical challenge, particularly for datasets with diverse characteristics. The Gap Statistic is a widely adopted method for determining ONC in K-means clustering, yet its performance is influenced by dataset size, feature complexity, and computational efficiency. This study systematically evaluates the accuracy, execution time, and coefficient of determination (R²) of the Gap Statistic across four distinct datasets sourced from GitHub: Well Log, Time Series, Iris, and Hitters. These datasets vary in size, structure, and domain, providing a comprehensive assessment of the method’s robustness. The Iris dataset exhibited the highest accuracy (87.25%) with an R² of 0.95, demonstrating the Gap Statistic’s superior clustering capability in well-structured datasets. The Time Series dataset followed closely, achieving 68.47% accuracy and R² = 0.88, reflecting moderate reliability. Conversely, the Well Log dataset attained only 57.98% accuracy (R² = 0.66), while the Hitters dataset performed the worst, with 48.41% accuracy and R² = 0.53, indicating poor clustering effectiveness. Notably, datasets with higher ONC values (8 clusters in Well Log and Hitters) exhibited prolonged execution times (2.45 sec and 2.41 sec, respectively), highlighting computational inefficiencies.

Issue

Vol. 5 (2025)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Article Sidebar

Main Article Content

Abstract

Article Details