光学的推論：テキストを超えた表現力豊かな推論媒体としての画像の再考

要旨

CoT（Chain-of-Thought）は大規模言語モデル（LLMs）の性能を向上させ、マルチモーダル大規模言語モデル（MLLMs）にも拡張されている。さらに最近の研究では、テキストベースのマルチモーダル推論から、中間段階でテキストによる根拠と視覚的証拠の両方を組み込むことができるインタリーブ型モーダル推論へと移行している。本研究では、より大胆かつ野心的なアイデアを提案する。それは、画像単独で言語タスクとマルチモーダルタスクの両方における推論媒体として機能できるのではないかというものである。この探求のため、画像を独立した推論媒体として扱う光学的推論（optical reasoning）を提案する。本概念を、コンパクトな根拠の描画に最適化された視覚的レイアウトを実現するタイポグラフィ型光学的推論と、テキストとグラフィック要素を構造化された視覚的根拠に構成するグラフィカル型光学的推論の2つのバリアントで具体化する。数学、科学、およびインタリーブ型モーダル推論のベンチマークにおいて、光学的推論は従来のテキスト推論と同等またはそれを上回る性能を示しつつ、言語タスクでは平均28.57%、マルチモーダルタスクでは平均16%の推論トークンを削減し、テキスト推論の1.96倍のトークン効率を達成した。これらの結果は、画像が推論のための統一的な視覚的キャンバスを提供しながら、効果的かつ効率的に根拠をエンコードできることを示している。

English

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.