私の意図を見る：ビデオの細粒度オブジェクト理解のための視覚と言語表現の整合

要旨

我々はSWIM (See What I Mean) を提案する。これは、テキストプロンプトのみから高精度なオブジェクト理解を可能にする、視覚と言語の表現を整合させる新規な学習戦略である。マスクや点などの明示的な視覚プロンプトを必要とする既存手法とは異なり、SWIMは訓練時のみマスクによる教師信号を利用してクロスモーダルな注意機構を誘導し、推論時にはユーザが指定したオブジェクトにモデルが自動的に注目できるようにする。事前学習済みマルチモーダル大規模言語モデル(MLLM)のクロスアテンション解析により、系統的な不一致が明らかになった。すなわち、属性語は視覚モダリティにおいて鋭く局所的な活性化を示すのに対し、物体名詞は意味的参照バイアスと分散した高次表現のために拡散したパターンを生じる。このミスアライメントに対処するため、我々はNL-Referという拡張データセットを構築し、各オブジェクトマスクを正確な自然言語による指示表現と対応付けた。SWIMは物体名詞から多層のクロスアテンションマップを抽出し、正解マスクとの空間的な整合性を強制する。実験結果は、SWIMがテキストと視覚のアライメントを大幅に改善し、高精度なオブジェクト理解ベンチマークにおいて、視覚プロンプトに基づく手法を上回る性能を示すことを実証している。コードとデータは https://github.com/HumanMLLM/SWIM で公開されている。

English

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.