ChatPaper.aiChatPaper

勿仅一瞥:迈向通过选择性视觉重访实现的多模态交互推理

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

May 24, 2025
作者: Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
cs.AI

摘要

我们提出了v1,一种对多模态大语言模型(MLLMs)的轻量级扩展,它能够在推理过程中实现选择性视觉重访。当前MLLMs通常仅一次性处理视觉输入并完全依赖内部记忆进行推理,而v1引入了一种简单的指向-复制机制,使模型能够在整个推理过程中动态检索相关图像区域。该机制以最小改动增强了现有架构,允许模型根据其不断演变的假设情境化访问视觉标记。为了训练这一能力,我们构建了v1g数据集,包含30万条带有交错视觉定位标注的多模态推理轨迹。在三个多模态数学推理基准——MathVista、MathVision和MathVerse上的实验表明,v1相较于同类基线模型持续提升了性能,尤其是在需要细粒度视觉参考和多步推理的任务上。我们的结果表明,动态视觉访问是增强基于现实的多模态推理的一个有前景的方向。代码、模型及数据将公开发布,以支持未来研究。
English
We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.

Summary

AI-Generated Summary

PDF352June 2, 2025