기계적 정신 이미지: 잠재적 시각 토큰을 활용한 다중모드 추론 강화

초록

비전-언어 모델(VLMs)은 다중 모달 이해에서 뛰어난 성능을 보이지만, 텍스트만을 디코딩하는 방식으로 인해 시각적 추론을 언어화해야 하므로 시각적 상상력이 요구되는 작업에서의 성능이 제한된다. 최근 연구에서는 VLMs이 명시적인 이미지를 생성하도록 훈련시키려는 시도가 있었으나, 무거운 이미지 생성 사전 훈련이 추론 능력을 저해하는 경우가 많았다. 인간이 정신적 이미지—시각적 단서의 내적 구성과 조작—를 통해 추론하는 방식에서 영감을 받아, 우리는 VLMs이 명시적인 이미지를 생성하지 않고도 교차된 다중 모달 궤적을 통해 추론할 수 있는지 조사한다. 이를 위해, 우리는 Mirage라는 기계 정신적 이미지 프레임워크를 제안한다. 이 프레임워크는 VLMs의 디코딩 과정에 일반 텍스트와 함께 잠재적 시각 토큰을 추가한다. 구체적으로, 모델이 "시각적으로 생각"하기로 선택할 때마다, 모델은 자신의 은닉 상태를 다음 토큰으로 재구성하여 픽셀 수준의 이미지를 생성하지 않고도 다중 모달 궤적을 이어간다. 먼저, 잠재 토큰을 실제 이미지 임베딩으로부터의 증류를 통해 지도한 후, 텍스트만을 사용한 지도로 전환하여 잠재 궤적이 작업 목표와 긴밀하게 일치하도록 한다. 이후 강화 학습 단계를 통해 다중 모달 추론 능력을 더욱 향상시킨다. 다양한 벤치마크에서의 실험 결과, Mirage는 명시적인 이미지 생성 없이도 더 강력한 다중 모달 추론을 가능하게 함을 보여준다.

English

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

기계적 정신 이미지: 잠재적 시각 토큰을 활용한 다중모드 추론 강화

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

초록

Support