InfoNCE诱导高斯分布

摘要

对比学习已成为现代表征学习的基石，使得能够利用海量无标签数据训练任务专用模型和通用（基础）模型。对比训练中的典型损失函数是InfoNCE及其变体。本研究表明，InfoNCE目标函数会在对比训练产生的表征中诱导出高斯结构。我们通过两个互补的论证体系证实了这一结论。首先，在满足特定对齐性和集中性假设的条件下，我们证明高维表征的投影会渐近地趋近多元高斯分布。其次，在较宽松的假设下，我们证实添加具有渐近消失特性的小规模正则化项（该正则项可促进低特征范数和高特征熵）也能获得类似的渐近结果。我们通过在合成数据集和CIFAR-10数据集上对不同架构和规模的编码器进行实验，验证了表征中普遍存在的高斯特性。这一视角为对比学习中常见的高斯现象提供了理论解释。由此建立的高斯模型使得对习得表征进行原理性分析成为可能，预计将支持对比学习中的广泛应用。

English

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.