InfoNCE 誘導出高斯分佈

摘要

對比學習已成為現代表徵學習的基石，使模型能夠利用大規模未標記數據進行任務特定及通用（基礎）模型的訓練。InfoNCE損失函數及其變體是對比訓練中的典型目標。本研究證明，InfoNCE目標函數會在對比訓練產生的表徵中誘導出高斯結構。我們通過兩種互補機制確立此結論：首先，在特定對齊性和集中性假設下，高維表徵的投影會漸近趨近多元高斯分佈；其次，在較寬鬆的假設下，我們證明添加促進低特徵範數與高特徵熵的漸近消失正則化項，同樣能導出類似的漸近結果。我們通過在合成數據集與CIFAR-10數據集上，針對多種架構與規模的編碼器進行實驗，驗證了表徵行為的一致性高斯特徵。此視角為對比表徵中常見的高斯現象提供了理論解釋，而由此建立的高斯模型不僅能對學習表徵進行原理性分析處理，更有望支撐對比學習中的廣泛應用場景。

English

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.

InfoNCE 誘導出高斯分佈

InfoNCE Induces Gaussian Distribution

摘要

Support