一度だけ見るな：選択的視覚再訪によるマルチモーダル対話型推論に向けて

要旨

我々は、推論中に選択的な視覚的再訪を可能にする、マルチモーダル大規模言語モデル（MLLMs）に対する軽量な拡張であるv1を提案する。現在のMLLMsは通常、視覚入力を一度だけ消費し、内部メモリのみに基づいて推論を行うが、v1は、モデルが推論プロセス全体を通じて関連する画像領域を動的に取得できるようにするシンプルなポイント・アンド・コピー機構を導入する。この機構は、既存のアーキテクチャに最小限の変更を加えることで、モデルの進化する仮説に基づいて視覚トークンに文脈的にアクセスできるようにする。この能力を訓練するために、我々は、視覚的グラウンディング注釈が交互に配置された30万のマルチモーダル推論トレースからなるデータセットv1gを構築した。MathVista、MathVision、MathVerseという3つのマルチモーダル数学推論ベンチマークでの実験により、v1が比較可能なベースラインを一貫して上回り、特に細かい視覚的参照と多段階の推論を必要とするタスクにおいて性能が向上することが示された。我々の結果は、動的な視覚的アクセスが、グラウンディングされたマルチモーダル推論を強化するための有望な方向性であることを示唆している。コード、モデル、データは、将来の研究を支援するために公開される予定である。

English

We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.

一度だけ見るな：選択的視覚再訪によるマルチモーダル対話型推論に向けて

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

要旨

Support