關係視覺相似性
Relational Visual Similarity
December 8, 2025
作者: Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li
cs.AI
摘要
人類不僅能感知屬性相似性——我們同樣能察覺關係相似性。蘋果與桃子相似是因為兩者都是紅色水果,但地球也與桃子相似:其地殼、地幔和地核分別對應桃子的表皮、果肉和果核。這種感知與識別關係相似性的能力,被認知科學家認為是人類有別於其他物種的關鍵特徵。然而,當前所有廣泛使用的視覺相似度度量標準(如LPIPS、CLIP、DINO)僅聚焦於感知屬性的相似性,未能捕捉人類所感知的豐富且常令人驚奇的關係相似性。我們該如何超越圖像的可見內容,捕捉其關係特性?如何讓具有相同關係邏輯的圖像在表徵空間中更接近?為解答這些問題,我們首先將關係性圖像相似度定義為可量化的問題:當兩張圖像的視覺元素之間的內部關係或功能相對應時,即便其視覺屬性不同,即具有關係相似性。接著我們策劃了一個包含11.4萬張圖像-文字說明的數據集,其中文字說明經過匿名化處理——描述場景底層的關係邏輯而非表面內容。利用此數據集,我們對視覺-語言模型進行微調,以測量圖像間的關係相似度。該模型成為透過底層關係結構(而非可見外觀)連結圖像的第一步。我們的研究表明,雖然關係相似性在現實世界有廣泛應用,現有圖像相似度模型卻無法捕捉它——這揭示了視覺計算領域的關鍵空白。
English
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.