Conference Proceedings

Seeding on samples for accelerating k-means clustering

JS Low, Z Ghafoori, JC Bezdek, C Leckie

Proceedings of the 3rd International Conference on Big Data and Internet of Things - BDIOT 2019 | ACM | Published : 2019

Abstract

K-means clustering with random seeds results in arbitrarily poor clusters. Much work as been done to improve initial centroid selection, also known as seeding, however better seeding algorithms are not scalable to large or unloadable datasets. In this paper, we first show that running the D2 seeding used in k-means++ on a random sample then clustering the whole dataset results in faster runtime and comparable accuracy compared to the original algorithm. We then propose a new method that performs the D2 seeding and clustering on the random sample. This method essentially runs k-means++ on the sample, then extends cluster assignments to every other point using nearest centroid classification. ..

View full abstract