思维渲染:将文本链式思维转化为视觉潜在推理图像
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
January 21, 2026
作者: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
cs.AI
摘要
思维链提示技术在解锁大型语言模型推理能力方面取得了显著成功。尽管该技术能增强推理性能,但其冗长的特性带来了巨大的计算开销。现有研究往往仅关注结果对齐,而缺乏对中间推理过程的监督,这种缺陷使得潜在推理链的可分析性变得模糊。为解决这些挑战,我们提出思维渲染框架——首个通过将文本推理步骤可视化呈现为图像来具象化推理链的方法,使潜在逻辑变得显式化且可追溯。具体而言,我们利用现有视觉语言模型中视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现方式,无需额外预训练开销。在数学与逻辑推理基准测试上的大量实验表明,相较于显式思维链方法,我们的方案实现了3-4倍的令牌压缩和显著的推理加速,同时与其他方法相比保持竞争力,验证了该范式的可行性。代码已开源于https://github.com/TencentBAC/RoT。
English
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT