嵌入模型如何将概念进行绑定？

摘要

人類在多物體場景中能輕易判斷哪個顏色屬於哪個形狀，這種能力稱為概念綁定。視覺-語言嵌入模型（如CLIP）在綁定上表現不佳：它們能識別個別概念，但無法表徵哪些概念構成了哪些物體。儘管CLIP在跨模態檢索中表現得像一個概念袋模型，但物體資訊仍可分別從其圖像和文字嵌入中恢復。我們透過綁定函數來研究這種張力，該函數將概念映射到場景嵌入。我們發現場景嵌入可加性分解為物體表徵，這解釋了為何單模態探針能恢復物體資訊。然而，CLIP的綁定函數具有高複雜性，這可能阻礙圖像和文字編碼器學習一個能推廣到未見概念組合的共享綁定機制。接著我們探討此限制是否為本質性的。我們證明並非如此。在從頭訓練的受控Transformer模型中，只要資料覆蓋足夠，綁定泛化能力便會出現。這些模型學習到低複雜性的綁定函數，其特徵是概念間的乘法交互作用，從而實現系統性泛化。程式碼已公開於 https://github.com/oshapio/binding-concepts-complexity。

English

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.