Zie wat ik bedoel: het afstemmen van visie- en taalrepresentaties voor fijnmazig objectbegrip in video

Samenvatting

Wij presenteren SWIM (See What I Mean), een nieuwe trainingsstrategie die visuele en taalkundige representaties op elkaar afstemt om fijnmazig objectbegrip mogelijk te maken, uitsluitend op basis van tekstuele prompts. In tegenstelling tot bestaande benaderingen die expliciete visuele prompts vereisen, zoals masks of punten, maakt SWIM alleen tijdens de training gebruik van mask-supervisie om cross-modale aandacht te sturen, waardoor het model tijdens het infereren automatisch kan focussen op het door de gebruiker gespecificeerde object. Onze cross-attentieanalyse van voorgetrainde multimodale grote taalmodellen (MLLMs) onthult een systematische discrepantie: attribuutwoorden produceren scherpe, gelokaliseerde activaties in de visuele modaliteit, terwijl object-nomina diffuse en verspreide patronen vertonen als gevolg van semantische referentiebias en gedistribueerde representaties op hoog niveau. Om deze misalignering aan te pakken, construeren we NL-Refer, een verrijkte dataset, waarin elk objectmask wordt gekoppeld aan een precieze natuurlijke taaluitdrukking. SWIM extraheert multi-laags cross-attentiekaarten van object-nomina en dwingt ruimtelijke consistentie af met ground-truth masks. Experimentele resultaten tonen aan dat SWIM de tekst-visuele afstemming aanzienlijk verbetert en superieure prestaties levert ten opzichte van op visuele prompts gebaseerde methoden op benchmarks voor fijnmazig objectbegrip. De code en data zijn beschikbaar op https://github.com/HumanMLLM/SWIM.

English

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.