構成的一般化には、視覚埋め込みモデルにおける線形かつ直交的な表現が求められる

要旨

構成性的一般化（慣れ親しんだ要素を新たな文脈で認識する能力）は、知的システムの定義的特徴である。現代のモデルは大規模データセットで学習されているものの、可能な入力の組み合わせ空間のごく一部しかカバーしておらず、未見の組み合わせへの一般化を支えるために表現がどのような構造を持つ必要があるのかという疑問が生じる。本研究では、標準的な学習条件下での構成性的一般化に対する3つの要請（分割可能性、転移可能性、安定性）を定式化し、これらが幾何学的な制約を必然的に課すことを示す：表現は概念ごとの構成要素へ線形分解可能でなければならず、これらの構成要素は概念間で直交していなければならない。これは「線形表現仮説」に理論的根拠を与える：神経表現に広く観察される線形構造は、構成性的一般化の必然的な帰結なのである。さらに、組み合わせ可能な概念の数と埋め込み幾何学を結び付ける次元の限界を導出する。実証的には、現代の視覚モデル（CLIP、SigLIP、DINO）でこれらの予測を検証し、表現が低ランクでほぼ直交する概念単位の因子による部分的な線形分解を示し、この構造の程度が未見の組み合わせにおける構成性的一般化と相関することを見出した。モデルがスケールし続けるにつれ、これらの条件はモデルが収束すべき表現的幾何学を予測する。コードはhttps://github.com/oshapio/necessary-compositionalityで公開されている。

English

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.

構成的一般化には、視覚埋め込みモデルにおける線形かつ直交的な表現が求められる

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

要旨

Support