Flash-KMeans: Snelle en Geheugenefficiënte Exacts K-Means

Samenvatting

K-means is historisch gezien voornamelijk gepositioneerd als een offline verwerkingsprimitive, typisch gebruikt voor datasetorganisatie of embedding-preprocessing, in plaats van als een eersteklas component in online systemen. In dit werk herzien we dit klassieke algoritme vanuit het perspectief van modern AI-systeemontwerp en maken we k-means mogelijk als een online primitive. Wij wijzen erop dat bestaande GPU-implementaties van k-means fundamenteel beperkt blijven door low-level systeembeperkingen in plaats van door theoretische algoritmische complexiteit. Specifiek lijdt de toewijzingsfase onder een ernstelijk I/O-knelpunt door de massale expliciete materialisatie van de N maal K afstandsmatrix in High Bandwidth Memory (HBM). Tegelijkertijd wordt de centroid-updatefase zwaar benadeeld door hardware-level atomic write contention, veroorzaakt door onregelmatige, scatter-style tokenaggregaties. Om deze prestatiekloof te overbruggen, stellen we flash-kmeans voor, een I/O-bewuste en contention-vrije k-means-implementatie voor moderne GPU-workloads. Flash-kmeans introduceert twee kerninnovatie op kernel-niveau: (1) FlashAssign, dat afstandsberekening fuseert met een online argmin om intermediare geheugenmaterialisatie volledig te omzeilen; (2) sort-inverse update, dat expliciet een inverse mapping construeert om hoog-conflict atomic scatters om te zetten in hoogbandbreedte, segment-level gelokaliseerde reducties. Verder integreren we algoritme-systeem co-designs, inclusief chunked-stream overlap en cache-aware compile heuristieken, om praktische deploybaarheid te garanderen. Uitgebreide evaluaties op NVIDIA H200 GPU's tonen aan dat flash-kmeans een end-to-end versnelling tot 17.9x bereikt ten opzichte van de beste baseline, terwijl het industristandaardbibliotheken zoals cuML en FAISS respectievelijk met 33x en meer dan 200x overtreft.

English

k-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable k-means as an online primitive. We point out that existing GPU implementations of k-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the N times K distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free k-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9times end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33times and over 200times, respectively.

Flash-KMeans: Snelle en Geheugenefficiënte Exacts K-Means

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Samenvatting

Support