组合泛化要求视觉嵌入模型具备线性正交表征

摘要

组合泛化，即在陌生语境中识别熟悉部分的能力，是智能系统的本质特征。尽管现代模型通过海量数据集进行训练，但其覆盖的输入组合空间仍只是冰山一角，这引发了一个关键问题：表征需具备何种结构才能支持对未见组合的泛化？我们通过可分解性、可迁移性和稳定性三个标准，形式化定义了常规训练下的组合泛化要求，并证明这些要求会施加必要的几何约束：表征必须能线性分解为各概念的分量，且这些分量在概念间需保持正交。这为线性表征假说提供了理论依据：神经网络表征中广泛观察到的线性结构，实则是组合泛化的必然结果。我们进一步推导出维度界限，将可组合概念的数量与嵌入几何特性相关联。实证研究中，我们在现代视觉模型（CLIP、SigLIP、DINO）上验证这些预测，发现表征确实呈现部分线性分解特征，其概念因子具有低秩且近似正交的特性，且这种结构的完善程度与模型在未见组合上的组合泛化能力呈正相关。随着模型规模的持续扩大，这些条件可预测其可能收敛的表征几何形态。代码已发布于 https://github.com/oshapio/necessary-compositionality。

English

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.

组合泛化要求视觉嵌入模型具备线性正交表征

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

摘要

Support