思維草稿化:透過邏輯重建實現光學解壓縮
Thinking with Drafting: Optical Decompression via Logical Reconstruction
February 12, 2026
作者: Jingxuan Wei, Honghao He, Caijun Jia, Siyuan Li, Zheng Sun, Yuhang Xu, Yuanyuan Lin, Linzhuang Sun, Yuchen Wu, Bihui Yu, Xiangxiang Zhang, Cheng Tan
cs.AI
摘要
現有多模態大型語言模型已實現高保真度的視覺感知與探索性視覺生成。然而在複雜推理任務中仍存在精確度悖論:光學感知系統能轉錄符號卻無法捕捉邏輯拓撲結構,而基於像素的生成模型會產生缺乏數學精確度的視覺偽影。為彌合此鴻溝,我們提出將視覺輸入的推理重新概念化為光學解壓縮——從壓縮視覺標記重建潛在邏輯結構的過程。以「解析即推理」為指導公理,我們引入「草稿式思維」框架,採用極簡領域特定語言作為基礎中間表徵。有別於直接幻覺化生成答案的標準方法,該框架強制模型將其心智模型草擬為可執行程式碼,生成確定性視覺證明以進行自我驗證。為驗證此方法,我們提出視覺代數基準測試集VisAlg。實驗表明,草稿式思維可作為優越的認知支架。本研究建立了一個閉環系統,使視覺生成不再作為創造性輸出,而是充當邏輯驗證器,為視覺推理提供可泛化的路徑。
English
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.