ChatPaper.aiChatPaper

少見多識:雙向感知塑造實現多模態推理

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

December 26, 2025
作者: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
cs.AI

摘要

大型視覺語言模型(VLMs)通常受益於中間視覺線索的輔助,無論是透過外部工具注入或是在推理過程中生成潛在視覺標記,但這些機制仍存在三大局限:忽略細粒度視覺證據(如圖表中的多邊形線條)、跨領域泛化能力較弱,以及推論時間成本高昂。本文提出雙向感知塑形(BiPS)方法,將問題導向的遮罩視圖轉化為雙向「注視指引」信號,在訓練過程中重塑模型感知。BiPS首先透過KL一致性約束,在原始圖像與僅保留問題相關區域的證據保全視圖之間建立關聯,促使模型對支持性像素進行粗略但完整的覆蓋。其次採用KL分離約束,對比原始圖像與關鍵像素被遮罩的證據消除視圖(此時圖像無法支持原答案),從而抑制純文本捷徑(即僅依賴文本作答)並強化對細粒度視覺特徵的依賴。在八項基準測試中,BiPS使Qwen2.5-VL-7B模型平均效能提升8.2%,並在未見過的資料集與圖像類型上展現出強大的跨領域泛化能力。
English
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
PDF91December 30, 2025