組合泛化要求視覺嵌入模型具備線性正交表徵

摘要

組合泛化——即在新穎情境中識別熟悉部分的能力——是智能系統的標誌性特徵。儘管現代模型通過海量數據集進行訓練，但其覆蓋的輸入組合空間僅佔可能性的極小部分，這引發了關鍵問題：表徵需具備何種結構才能支持對未見組合的泛化？我們形式化地提出了標準訓練下組合泛化的三個必要條件（可分性、可遷移性、穩定性），並證明它們會施加必要的幾何約束：表徵必須線性分解為按概念劃分的組件，且這些組件在概念間必須保持正交。這為「線性表徵假說」提供了理論基礎：神經表徵中廣泛觀察到的線性結構實際是組合泛化的必然結果。我們進一步推導出維度界限，將可組合概念的數量與嵌入幾何特性相聯繫。實證研究中，我們在現代視覺模型（CLIP、SigLIP、DINO）上驗證這些預測，發現表徵呈現出部分線性分解特徵，表現為低秩、近正交的單概念因子，且此結構化程度與模型在未見組合上的組合泛化能力相關。隨著模型規模持續擴大，這些條件預示了表徵幾何可能收斂的方向。程式碼公開於：https://github.com/oshapio/necessary-compositionality。

English

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.

組合泛化要求視覺嵌入模型具備線性正交表徵

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

摘要

Support