Flash-KMeans: 빠르고 메모리 효율적인 정확한 K-평균

초록

k-means는 역사적으로 주로 오프라인 처리 기본 요소로 위치해 왔으며, 온라인 시스템의 주요 구성 요소라기보다는 데이터셋 구성이나 임베딩 전처리에 일반적으로 사용되었습니다. 본 연구에서는 현대 AI 시스템 설계의 관점에서 이 고전적인 알고리즘을 재조명하고 k-means를 온라인 기본 요소로 구현합니다. 우리는 기존 GPU k-means 구현이 이론적 알고리즘 복잡성보다는 낮은 수준의 시스템 제약에 의해 근본적으로 병목 현상이 발생한다는 점을 지적합니다. 구체적으로, 할당 단계는 고대역폭 메모리(HBM) 내 N x K 거리 행렬의 방대한 명시적 구체화로 인해 심각한 입출력 병목 현상을 겪습니다. 동시에, 센트로이드 업데이트 단계는 불규칙하고 분산형 스타일의 토큰 집계로 인한 하드웨어 수준의 원자적 쓰기 경합으로 인해 심각한 성능 저하가 발생합니다. 이러한 성능 격차를 해결하기 위해 우리는 현대 GPU 워크로드를 위한 입출력 인식 및 경합 없는 k-means 구현체인 flash-kmeans를 제안합니다. Flash-kmeans는 두 가지 핵심 커널 수준 혁신을 도입합니다: (1) 중간 메모리 구체화를 완전히 우회하기 위해 거리 계산과 온라인 argmin을 융합한 FlashAssign; (2) 높은 경합을 일으키는 원자적 분산 연산을 고대역폭의 세그먼트 수준 지역적 리덕션으로 변환하기 위해 명시적 역매핑을 구성하는 정렬-역방향 업데이트. 더 나아가 실제 배포 가능성을 보장하기 위해 청크 스트림 중첩 및 캐시 인식 컴파일 휴리스틱을 포함한 알고리즘-시스템 공동 설계를 통합했습니다. NVIDIA H200 GPU에서의 광범위한 평가 결과, flash-kmeans가 최고의 기준선 대비 최대 17.9배의 종단 간 속도 향상을 달성했으며, cuML 및 FAISS와 같은 산업 표준 라이브러리보다 각각 33배, 200배 이상 우수한 성능을 보여주었습니다.

English

k-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable k-means as an online primitive. We point out that existing GPU implementations of k-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the N times K distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free k-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9times end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33times and over 200times, respectively.

Flash-KMeans: 빠르고 메모리 효율적인 정확한 K-평균

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

초록

Support