嵌入模型如何绑定概念？

摘要

人类能够轻松判断多物体场景中每种颜色所属的形状，这种能力被称为"概念绑定"。视觉-语言嵌入模型（如CLIP）在绑定任务上存在困难：它们能识别单个概念，却无法表征哪些概念属于哪些物体。尽管CLIP在跨模态检索中表现为词袋概念模型，但其图像和文本嵌入中仍可恢复物体信息。我们通过绑定函数（将概念映射至场景嵌入）研究这一张力，发现场景嵌入可分解为物体表征的加性组合，这解释了为何单模态探针能恢复物体信息。然而，CLIP的绑定函数具有高复杂度，这很可能阻碍了图像编码器与文本编码器学习共享的、能泛化至未见概念组合的绑定机制。我们进一步探究该限制是否为根本性问题，结果表明并非如此。在从零训练的受控Transformer模型中，当数据覆盖充分时，绑定泛化能力得以涌现。这些模型学习的低复杂度绑定函数通过概念间的乘法交互实现系统化泛化。代码已开源：https://github.com/oshapio/binding-concepts-complexity。

English

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.