機械的心像：潜在視覚トークンによるマルチモーダル推論の強化

要旨

視覚言語モデル（VLM）はマルチモーダル理解において優れた性能を発揮するが、テキストのみのデコードを強制されるため、視覚的推論を言語化する必要があり、視覚的想像力を必要とするタスクでの性能が制限される。最近の試みでは、VLMに明示的な画像を生成させる訓練が行われているが、重い画像生成の事前学習が推論能力を妨げることが多い。人間がメンタルイメージ（視覚的手がかりの内部構築と操作）を用いて推論する方法に着想を得て、我々はVLMが明示的な画像を生成せずに、インタリーブされたマルチモーダル軌跡を通じて推論できるかどうかを調査する。この目的のために、我々は「Mirage」と名付けた機械的メンタルイメージフレームワークを提案する。これは、通常のテキストに加えて潜在的な視覚トークンを用いてVLMのデコードを拡張するものである。具体的には、モデルが「視覚的に考える」ことを選択した場合、その隠れ状態を次のトークンとして再構築し、ピクセルレベルの画像を生成することなくマルチモーダル軌跡を継続する。最初に、潜在トークンをグラウンドトゥルースの画像埋め込みからの蒸留を通じて監督し、その後、テキストのみの監督に切り替えて、潜在軌跡をタスク目標に密接に整合させる。その後の強化学習段階では、マルチモーダル推論能力をさらに強化する。多様なベンチマークでの実験により、Mirageが明示的な画像生成なしに強力なマルチモーダル推論を実現することが示された。

English

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

機械的心像：潜在視覚トークンによるマルチモーダル推論の強化

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

要旨

Support