ChatPaper.aiChatPaper

看我所指:对齐视觉与语言表示以实现视频细粒度物体理解

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

May 18, 2026
作者: Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou
cs.AI

摘要

我们提出SWIM(See What I Mean,即“所见即所指”),一种新颖的训练策略,通过对齐视觉与语言表征,使得模型仅凭文本提示即可实现对物体的细粒度理解。与现有需要显式视觉提示(如掩码或点)的方法不同,SWIM仅在训练阶段利用掩码监督引导跨模态注意力,从而使模型在推理时能自动关注用户指定的物体。我们对预训练多模态大语言模型(MLLMs)的交叉注意力分析揭示了一个系统性差异:属性词在视觉模态中产生尖锐且局部的激活,而物体名词因语义指代偏差和分布式高层表征,呈现出弥散且分散的模式。为解决这一对齐问题,我们构建了NL-Refer增强数据集,其中每个物体掩码都与一条精确的自然语言指代表达配对。SWIM从物体名词中提取多层交叉注意力图,并强制其与真实掩码保持空间一致性。实验结果表明,SWIM显著提升了文本与视觉的对齐能力,并在细粒度物体理解基准上取得了优于基于视觉提示方法的性能。代码与数据已开源:https://github.com/HumanMLLM/SWIM。
English
We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.