구조적 일반화는 시각 임베딩 모델에서 선형적이고 직교적인 표현을 요구한다

초록

구성적 일반화는 새로운 맥락에서 익숙한 구성 요소를 인식하는 능력으로, 지능형 시스템의 정의적 속성입니다. 현대 모델은 방대한 데이터셋으로 훈련되지만, 여전히 가능한 입력의 조합 공간 중 극히 일부만을 커버하므로, 보이지 않는 조합으로의 일반화를 지원하기 위해 표현이 어떤 구조를 가져야 하는지에 대한 의문이 제기됩니다. 우리는 표준 훈련 하에서 구성적 일반화를 위한 세 가지 요건(분할 가능성, 전이 가능성, 안정성)을 공식화하고, 이들이 필요한 기하학적 제약을 부과함을 보입니다: 표현은 개념별 구성 요소로 선형 분해되어야 하며, 이러한 구성 요소는 개념 간에 직교해야 합니다. 이는 선형 표현 가설에 대한 이론적 근거를 제공합니다: 신경망 표현에서 널리 관찰되는 선형 구조는 구성적 일반화의 필연적 결과입니다. 우리는 더 나아가 구성 가능한 개념의 수와 임베딩 기하학을 연결하는 차원 경계를 유도합니다. 실증적으로는 현대 비전 모델(CLIP, SigLIP, DINO)에서 이러한 예측을 평가한 결과, 표현이 낮은 계급의 준직교 개념별 인자를 통한 부분적 선형 인수분해를 나타내며, 이러한 구조의 정도가 보이지 않는 조합에 대한 구성적 일반화와 상관관계가 있음을 확인했습니다. 모델의 규모가 계속 확장됨에 따라, 이러한 조건들은 모델이 수렴할 수 있는 표현 기하학을 예측합니다. 코드는 https://github.com/oshapio/necessary-compositionality에서 이용할 수 있습니다.

English

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.

구조적 일반화는 시각 임베딩 모델에서 선형적이고 직교적인 표현을 요구한다

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

초록

Support