Flash-KMeans: 高速かつメモリ効率の良い正確なK-Means

要旨

k-meansは歴史的に、主にオフライン処理のプリミティブとして位置づけられており、オンラインシステムの第一級コンポーネントというよりも、データセットの整理や埋め込み前処理に使用されることが一般的でした。本研究では、現代のAIシステム設計の観点からこの古典的アルゴリズムを再検討し、k-meansをオンラインプリミティブとして実現します。既存のGPU実装では、理論的なアルゴリズムの複雑さよりも、低レベルのシステム制約が根本的なボトルネックとなっていることを指摘します。具体的には、割り当て段階では、N×Kの距離行列が高帯域幅メモリ（HBM）に大規模かつ明示的に実体化されるため、深刻なI/Oボトルネックが発生します。同時に、重心更新段階では、不規則な散在型トークン集約によるハードウェアレベルのアトミック書き込み競合によって、大幅な性能低下が生じます。この性能差を埋めるため、我々は現代のGPUワークロード向けのI/Oを考慮した競合のないk-means実装であるflash-kmeansを提案します。Flash-kmeansは、2つのコアとなるカーネルレベルの革新を導入します：（1）距離計算とオンラインargminを融合し、中間メモリへの実体化を完全に回避するFlashAssign、（2）高競合のアトミック散在操作を、高帯域幅のセグメント単位局所縮約に変換するために逆写像を明示的に構築するsort-inverse更新法です。さらに、実用的な展開性を確保するため、チャンク化ストリームオーバーラップやキャッシュを考慮したコンパイルヒューリスティックなど、アルゴリズムとシステムの協調設計を統合しています。NVIDIA H200 GPUでの広範な評価により、flash-kmeansが既存の最良ベースラインに対して最大17.9倍のエンドツーエンド高速化を達成し、cuMLやFAISSのような業界標準ライブラリをそれぞれ33倍、200倍以上上回る性能を示すことを実証しました。

English

k-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable k-means as an online primitive. We point out that existing GPU implementations of k-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the N times K distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free k-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9times end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33times and over 200times, respectively.

Flash-KMeans: 高速かつメモリ効率の良い正確なK-Means

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

要旨

Support