Flash-GMM: 一种面向可扩展软聚类的内存高效内核

摘要

我们提出了Flash-GMM，一种融合的Triton内核，用于在单个GPU上高效计算大规模数据的高斯混合模型（GMM）。通过在GPU内存中避免实例化完整的责任矩阵，Flash-GMM相比现有实现实现了20倍加速，并使得在单个设备上训练的数据集规模比之前可行的大100倍以上。为展示其影响，我们将Flash-GMM集成到IVF粗略量化器中，用于近似最近邻（ANN）搜索。我们证明，软GMM聚类现在可作为k-means的即插即用替代方案，并且可以利用GMM责任将边界向量分配到多个聚类中。我们的方法在达到固定召回目标时，最多可减少1.7倍的距离计算次数，或在相同计算成本下，召回率@10提高2-12点。我们将该内核作为开源项目发布。

English

We present Flash-GMM, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a 20times speedup over existing implementations and enables training on datasets more than 100times larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for k-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to 1.7times fewer distance computations, or equivalently, yields +2--12 recall@10 at matched computational cost. We release the kernel as an open-source project.