ChatPaper.aiChatPaper

见少识广:面向多模态推理的双向感知塑造

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

December 26, 2025
作者: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
cs.AI

摘要

大型视觉语言模型(VLMs)通常受益于中间视觉线索的引入——无论是通过外部工具注入还是在推理过程中生成潜在视觉标记,但这些机制仍存在三个局限:忽视细粒度视觉证据(如图表中的折线)、跨领域泛化能力弱,以及推理成本高昂。本文提出双向感知塑形(BiPS)方法,通过将问题条件化的掩码视图转化为双向的"关注位置"信号,在训练过程中重塑视觉感知。BiPS首先在原始图像与仅保留问题相关区域的证据保全视图之间施加KL一致性约束,确保对支持性像素实现粗粒度但完整的覆盖;随后在原始图像与关键像素被掩码的证据消除视图之间施加KL分离约束——该掩码使图像无法支撑原答案,从而抑制纯文本捷径(即仅凭文本答题)并强化模型对细粒度视觉特征的依赖。在八项基准测试中,BiPS使Qwen2.5-VL-7B模型平均性能提升8.2%,并在未见过的数据集和图像类型上展现出强大的跨领域泛化能力。
English
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
PDF91December 30, 2025