光学推理:重新思考图像作为一种超越文本的表达性推理媒介
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text
June 8, 2026
作者: Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li
cs.AI
摘要
思维链(Chain-of-Thought, CoT)提升了大型语言模型(LLMs)的性能,并已扩展至多模态大语言模型(MLLMs)。近期研究进一步从基于文本的多模态推理转向交错模态推理,其中间步骤可同时包含文本理据与视觉证据。在本工作中,我们提出一个更大胆且更具雄心的设想:图像能否独立作为语言任务与多模态任务的推理媒介?为探索此问题,我们提出光学推理(optical reasoning),将图像视为独立的推理媒介。我们通过两种变体实例化这一概念:基于排印的光学推理(typographic-based optical reasoning),通过优化视觉布局以实现紧凑的理据渲染;以及基于图形的光学推理(graphical-based optical reasoning),将文本与图形元素组合成结构化的视觉理据。在数学、科学及交错模态推理基准测试中,光学推理可匹配甚至超越传统文本推理,同时在语言任务上平均减少28.57%的推理令牌,在多模态任务上减少16%,实现文本推理1.96倍的令牌效率。这些结果表明,图像能够有效且高效地编码理据,同时为推理提供统一的视觉画布。
English
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.