看見我所指:對齊視覺與語言表徵以實現影片細粒度物體理解
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
May 18, 2026
作者: Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou
cs.AI
摘要
我們提出 SWIM (See What I Mean),這是一種新穎的訓練策略,能夠對齊視覺與語言表徵,僅透過文字提示即可實現細粒度的物體理解。不同於現有方法需要明確的視覺提示(如遮罩或點),SWIM 僅在訓練期間利用遮罩監督來引導跨模態注意力,使模型在推論時能自動關注使用者指定的物體。我們對預訓練多模態大語言模型 (MLLMs) 的跨注意力分析揭示了一個系統性的偏差:屬性詞在視覺模態中產生尖銳且局部的活化,而物體名詞則因語義參照偏差與分佈式高層表徵,呈現擴散且零散的模式。為了解決這種錯位,我們建構了 NL-Refer 這個擴充資料集,其中每個物體遮罩都與精確的自然語言指代表達配對。SWIM 從物體名詞中提取多層跨注意力圖,並強制其與真實遮罩具有空間一致性。實驗結果顯示,SWIM 顯著改善了文字與視覺的對齊,並在細粒度物體理解基準上超越以視覺提示為基礎的方法,達到更優異的表現。程式碼與資料請見 https://github.com/HumanMLLM/SWIM。
English
We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.