視覺推理：將影像重新構想為超越文字的表達性推理媒介

摘要

思维链（Chain-of-Thought, CoT）提升了大型语言模型（LLMs）的性能，并已被扩展至多模态大型语言模型（MLLMs）。近期研究更进一步，从基于文本的多模态推理转向交织模态推理，其中间步骤可同时包含文本推理依据与视觉证据。在本工作中，我们提出一个更大胆且更具雄心的设想：图像能否单独作为推理媒介，服务于语言任务与多模态任务？为探索此问题，我们提出光学推理（optical reasoning），将图像视为独立的推理媒介。我们通过两种变体实例化这一概念：基于排印的光学推理，通过优化视觉布局实现紧凑的推理依据呈现；以及基于图形的光学推理，将文本与图形元素组合为结构化的视觉推理依据。在数学、科学及交织模态推理等基准测试中，光学推理能够匹配甚至超越传统文本推理，同时在语言任务中平均减少28.57%的推理令牌，在多模态任务中平均减少16%的推理令牌，实现文本推理1.96倍的令牌效率。这些结果表明，图像能够高效且有效地编码推理依据，同时为推理过程提供统一的视觉画布。

English

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.