突破内存屏障：用于对比损失的近乎无限批量大小扩展

摘要

对比损失是一种强大的表示学习方法，通过提供更多负样本来增强性能，从而更好地区分相似和不相似的数据，扩大批量大小。然而，批量大小的扩展受到 GPU 内存消耗的二次增长的限制，主要是由于完全实例化相似度矩阵。为了解决这个问题，我们提出了一种基于瓦片的计算策略，将对比损失计算分成任意小的块，避免完全实例化相似度矩阵。此外，我们引入了多级切片策略，利用分布式系统的分层结构，利用 GPU 层面的基于环的通信来优化同步，并在 CUDA 核心层面使用融合内核来减少 I/O 开销。实验结果表明，所提出的方法使批量大小扩展到前所未有的水平。例如，它使得可以使用 8 或 32 个 A800 80GB 进行对比训练 CLIP-ViT-L/14 模型，批量大小为 4M 或 12M，而不损失任何准确性。与 SOTA 的内存高效解决方案相比，它实现了内存减少两个数量级，同时保持可比较的速度。代码将公开发布。

English

Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is constrained by the quadratic growth in GPU memory consumption, primarily due to the full instantiation of the similarity matrix. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into arbitrary small blocks, avoiding full materialization of the similarity matrix. Furthermore, we introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems, employing ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method scales batch sizes to unprecedented levels. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M or 12M using 8 or 32 A800 80GB without sacrificing any accuracy. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available.

突破内存屏障：用于对比损失的近乎无限批量大小扩展

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

摘要

Support