ChatPaper.aiChatPaper

Flash-GMM: 一种面向可扩展软聚类的内存高效内核

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

June 9, 2026
作者: Gal Bloch, Ariel Gera, Matan Orbach, Ohad Eytan, Assaf Toledo
cs.AI

摘要

我们提出了Flash-GMM,一种融合的Triton内核,用于在单个GPU上高效计算大规模数据的高斯混合模型(GMM)。通过在GPU内存中避免实例化完整的责任矩阵,Flash-GMM相比现有实现实现了20倍加速,并使得在单个设备上训练的数据集规模比之前可行的大100倍以上。为展示其影响,我们将Flash-GMM集成到IVF粗略量化器中,用于近似最近邻(ANN)搜索。我们证明,软GMM聚类现在可作为k-means的即插即用替代方案,并且可以利用GMM责任将边界向量分配到多个聚类中。我们的方法在达到固定召回目标时,最多可减少1.7倍的距离计算次数,或在相同计算成本下,召回率@10提高2-12点。我们将该内核作为开源项目发布。
English
We present Flash-GMM, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a 20times speedup over existing implementations and enables training on datasets more than 100times larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for k-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to 1.7times fewer distance computations, or equivalently, yields +2--12 recall@10 at matched computational cost. We release the kernel as an open-source project.