思維繪製:將文本思維鏈轉化為視覺潛在推理圖像
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
January 21, 2026
作者: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
cs.AI
摘要
思維鏈提示技術在釋放大語言模型的推理能力方面取得了顯著成功。儘管思維鏈提示能增強推理性能,但其冗長的特性帶來了巨大的計算開銷。現有研究往往僅關注結果對齊,缺乏對中間推理過程的監督機制,這些缺陷使得潛在推理鏈的可分析性變得模糊。為解決這些挑戰,我們提出了思維渲染框架——首個通過將文本推理步驟可視化為圖像來具象化推理鏈的架構,使潛在邏輯變得顯性化且可追溯。具體而言,我們利用現有視覺語言模型中視覺編碼器作為語義錨點,將視覺嵌入與文本空間對齊。這種設計確保了即插即用的實現方式,無需額外的預訓練成本。在數學與邏輯推理基準測試上的大量實驗表明,相較於顯性思維鏈,我們的方法能實現3-4倍的標記壓縮和顯著的推理加速,同時與其他方法保持競爭力,驗證了此範式的可行性。相關代碼已開源於:https://github.com/TencentBAC/RoT
English
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT