勿止步於一瞥:邁向多模態互動推理與選擇性視覺重訪
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
May 24, 2025
作者: Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
cs.AI
摘要
我們提出了v1,這是一個對多模態大型語言模型(MLLMs)的輕量級擴展,使模型在推理過程中能夠選擇性地重新訪問視覺信息。當前MLLMs通常僅在初始階段處理視覺輸入,之後完全依賴內部記憶進行推理,而v1引入了一種簡單的點選複製機制,允許模型在整個推理過程中動態檢索相關的圖像區域。該機制以最小的修改增強了現有架構,使模型能夠基於其不斷演變的假設,上下文相關地訪問視覺標記。為了訓練這一能力,我們構建了v1g,這是一個包含30萬條多模態推理軌跡的數據集,其中交織著視覺基礎註釋。在三個多模態數學推理基準——MathVista、MathVision和MathVerse上的實驗表明,v1相較於可比基線模型持續提升了性能,尤其是在需要細粒度視覺參考和多步推理的任務上。我們的結果表明,動態視覺訪問是增強基於多模態推理的一個有前景的方向。代碼、模型和數據將被公開以支持未來的研究。
English
We present v1, a lightweight extension to Multimodal Large Language Models
(MLLMs) that enables selective visual revisitation during inference. While
current MLLMs typically consume visual input only once and reason purely over
internal memory, v1 introduces a simple point-and-copy mechanism that allows
the model to dynamically retrieve relevant image regions throughout the
reasoning process. This mechanism augments existing architectures with minimal
modifications, enabling contextual access to visual tokens based on the model's
evolving hypotheses. To train this capability, we construct v1g, a dataset of
300K multimodal reasoning traces with interleaved visual grounding annotations.
Experiments on three multimodal mathematical reasoning benchmarks -- MathVista,
MathVision, and MathVerse -- demonstrate that v1 consistently improves
performance over comparable baselines, particularly on tasks requiring
fine-grained visual reference and multi-step reasoning. Our results suggest
that dynamic visual access is a promising direction for enhancing grounded
multimodal reasoning. Code, models, and data will be released to support future
research.