思维与草拟:通过逻辑重构实现光学解压缩
Thinking with Drafting: Optical Decompression via Logical Reconstruction
February 12, 2026
作者: Jingxuan Wei, Honghao He, Caijun Jia, Siyuan Li, Zheng Sun, Yuhang Xu, Yuanyuan Lin, Linzhuang Sun, Yuchen Wu, Bihui Yu, Xiangxiang Zhang, Cheng Tan
cs.AI
摘要
现有 multimodal 大语言模型已实现高保真的视觉感知与探索性视觉生成。然而,复杂推理任务中仍存在精确性悖论:光学感知系统能转录符号却无法捕捉逻辑拓扑结构,而基于像素的生成模型会产生缺乏数学精确性的视觉伪影。为弥合这一鸿沟,我们提出将视觉输入推理重新定义为光学解压缩——即从压缩的视觉标记中重建潜在逻辑结构的过程。以"解析即推理"为准则,我们引入思维草图法(TwD),该方法采用极简领域特定语言(DSL)作为基础中间表示。与直接幻觉生成答案的标准方法不同,TwD强制模型将其心智模型草拟为可执行代码,通过确定性视觉证明实现自我验证。为此我们提出视觉代数基准测试集VisAlg。实验表明TwD可成为更优的认知支架。本研究构建了一个闭环系统,使视觉生成不再作为创造性输出,而是充当逻辑验证器,为视觉推理提供了可泛化的实现路径。
English
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.