光学推理：重新思考图像作为一种超越文本的表达性推理媒介

摘要

思维链（Chain-of-Thought, CoT）提升了大型语言模型（LLMs）的性能，并已扩展至多模态大语言模型（MLLMs）。近期研究进一步从基于文本的多模态推理转向交错模态推理，其中间步骤可同时包含文本理据与视觉证据。在本工作中，我们提出一个更大胆且更具雄心的设想：图像能否独立作为语言任务与多模态任务的推理媒介？为探索此问题，我们提出光学推理（optical reasoning），将图像视为独立的推理媒介。我们通过两种变体实例化这一概念：基于排印的光学推理（typographic-based optical reasoning），通过优化视觉布局以实现紧凑的理据渲染；以及基于图形的光学推理（graphical-based optical reasoning），将文本与图形元素组合成结构化的视觉理据。在数学、科学及交错模态推理基准测试中，光学推理可匹配甚至超越传统文本推理，同时在语言任务上平均减少28.57%的推理令牌，在多模态任务上减少16%，实现文本推理1.96倍的令牌效率。这些结果表明，图像能够有效且高效地编码理据，同时为推理提供统一的视觉画布。

English

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.