思考を草稿で練る：論理的な再構築による光学的解凍

要旨

既存のマルチモーダル大規模言語モデルは、高精度の視覚的知覚と探索的視覚生成を実現している。しかし、複雑な推論タスクには精度のパラドックスが存在する：光学的知覚システムは論理的なトポロジーを捉えずに記号を転写し、ピクセルベースの生成モデルは数学的精確性を欠く視覚的アーティファクトを生成する。この隔たりを埋めるため、視覚入力に対する推論を「光学的圧縮解除」、すなわち圧縮された視覚トークンから潜在的な論理構造を再構築するプロセスとして再概念化することを提案する。「解析は推論である」という公理に基づき、最小限のドメイン固有言語（DSL）を基礎表現として用いるThinking with Drafting（TwD）を導入する。回答を直接生成する標準的アプローチとは異なり、TwDはモデルにメンタルモデルを実行可能コードとして起草させ、自己検証のための決定論的視覚的証明を生成させる。これを検証するため、視覚代数ベンチマークVisAlgを提案する。実験により、TwDが優れた認知的足場として機能することが示される。本研究は、視覚生成を創造的出力ではなく論理的検証器として機能させる閉ループシステムを確立し、視覚推論への一般化可能な道筋を提供する。

English

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

思考を草稿で練る：論理的な再構築による光学的解凍

Thinking with Drafting: Optical Decompression via Logical Reconstruction

要旨

Support