임베딩 모델은 어떻게 개념을 결합할 수 있을까?

초록

인간은 다중 객체 장면에서 어떤 색상이 어떤 형태에 속하는지 쉽게 판단하는데, 이러한 능력을 개념 결합(concept binding)이라고 한다. CLIP과 같은 시각-언어 임베딩 모델은 결합에 어려움을 겪는다. 즉, 개별 개념은 인식하지만 어떤 개념들이 어떤 객체를 구성하는지는 표현하지 못한다. CLIP이 교차 양식 검색에서 개념 가방(bag-of-concepts) 모델처럼 행동하지만, 객체 정보는 이미지와 텍스트 임베딩에서 각각 복원 가능하다. 본 연구는 개념을 장면 임베딩에 매핑하는 결합 함수(binding function)를 통해 이러한 긴장 관계를 분석한다. 장면 임베딩이 객체 표현으로 가법적으로 분해되며, 이는 단일 양식 프로브(unimodal probe)가 객체 정보를 복원할 수 있는 이유를 설명한다. 그러나 CLIP의 결합 함수는 높은 복잡성을 가지며, 이는 이미지와 텍스트 인코더가 보지 못한 개념 조합에 일반화되는 공유 결합 메커니즘을 학습하지 못하게 할 가능성이 있다. 이후 이러한 한계가 근본적인지 질문하며, 그렇지 않음을 보인다. 제어된 트랜스포머 모델을 처음부터 훈련할 때, 충분한 데이터 커버리지가 있으면 결합 일반화가 나타난다. 이러한 모델은 개념 간 곱셈 상호작용을 특징으로 하는 저복잡성 결합 함수를 학습하여 체계적인 일반화를 가능하게 한다. 코드는 https://github.com/oshapio/binding-concepts-complexity에서 공개되어 있다.

English

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.