視覺推理:將影像重新構想為超越文字的表達性推理媒介
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text
June 8, 2026
作者: Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li
cs.AI
摘要
思维链(Chain-of-Thought, CoT)提升了大型语言模型(LLMs)的性能,并已被扩展至多模态大型语言模型(MLLMs)。近期研究更进一步,从基于文本的多模态推理转向交织模态推理,其中间步骤可同时包含文本推理依据与视觉证据。在本工作中,我们提出一个更大胆且更具雄心的设想:图像能否单独作为推理媒介,服务于语言任务与多模态任务?为探索此问题,我们提出光学推理(optical reasoning),将图像视为独立的推理媒介。我们通过两种变体实例化这一概念:基于排印的光学推理,通过优化视觉布局实现紧凑的推理依据呈现;以及基于图形的光学推理,将文本与图形元素组合为结构化的视觉推理依据。在数学、科学及交织模态推理等基准测试中,光学推理能够匹配甚至超越传统文本推理,同时在语言任务中平均减少28.57%的推理令牌,在多模态任务中平均减少16%的推理令牌,实现文本推理1.96倍的令牌效率。这些结果表明,图像能够高效且有效地编码推理依据,同时为推理过程提供统一的视觉画布。
English
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.