한 번만 보지 마라: 선택적 시각 재방문을 통한 다중모드 상호작용적 추론을 향하여

초록

본 논문에서는 다중모달 대형 언어 모델(MLLMs)에 경량 확장 기능인 v1을 제안합니다. v1은 추론 과정에서 선택적 시각 재방문을 가능하게 합니다. 기존 MLLMs는 일반적으로 시각 입력을 한 번만 소비하고 내부 메모리만을 기반으로 추론하는 반면, v1은 간단한 포인트-앤-복사(point-and-copy) 메커니즘을 도입하여 모델이 추론 과정 전반에 걸쳐 관련 이미지 영역을 동적으로 검색할 수 있도록 합니다. 이 메커니즘은 기존 아키텍처에 최소한의 수정만으로 추가되며, 모델의 진화하는 가설에 기반하여 시각 토큰에 대한 문맥적 접근을 가능하게 합니다. 이러한 기능을 학습하기 위해, 우리는 30만 개의 다중모달 추론 트레이스와 인터리브된 시각적 근거 주석으로 구성된 v1g 데이터셋을 구축했습니다. MathVista, MathVision, MathVerse 등 세 가지 다중모달 수학적 추론 벤치마크에서의 실험 결과, v1은 특히 세밀한 시각적 참조와 다단계 추론이 필요한 작업에서 비교 가능한 베이스라인 대비 지속적으로 성능을 향상시킴을 보여줍니다. 우리의 결과는 동적 시각 접근이 근거 기반 다중모달 추론을 강화하기 위한 유망한 방향임을 시사합니다. 향후 연구를 지원하기 위해 코드, 모델 및 데이터를 공개할 예정입니다.

English

We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.

한 번만 보지 마라: 선택적 시각 재방문을 통한 다중모드 상호작용적 추론을 향하여

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

초록

Support