Optimal Cluster Determination in K-Means Using Gap Statistic Analysis Across Diverse Datasets

Main Article Content

Iliyas Karim Khan
Hanita Binti Daud
Nooraini Binti Zainuddin
Rajalingam Sokkalingam
Mudasar Zafar
Soofia Iftikhar
Agha Inayat
Atta Ullah
Abdul Museeb

Abstract

Clustering is a fundamental technique in unsupervised machine learning, where selecting the optimal number of clusters (ONC) remains a critical challenge, particularly for datasets with diverse characteristics. The Gap Statistic is a widely adopted method for determining ONC in K-means clustering, yet its performance is influenced by dataset size, feature complexity, and computational efficiency. This study systematically evaluates the accuracy, execution time, and coefficient of determination (R²) of the Gap Statistic across four distinct datasets sourced from GitHub: Well Log, Time Series, Iris, and Hitters. These datasets vary in size, structure, and domain, providing a comprehensive assessment of the method’s robustness. The Iris dataset exhibited the highest accuracy (87.25%) with an R² of 0.95, demonstrating the Gap Statistic’s superior clustering capability in well-structured datasets. The Time Series dataset followed closely, achieving 68.47% accuracy and R² = 0.88, reflecting moderate reliability. Conversely, the Well Log dataset attained only 57.98% accuracy (R² = 0.66), while the Hitters dataset performed the worst, with 48.41% accuracy and R² = 0.53, indicating poor clustering effectiveness. Notably, datasets with higher ONC values (8 clusters in Well Log and Hitters) exhibited prolonged execution times (2.45 sec and 2.41 sec, respectively), highlighting computational inefficiencies.

Article Details