思考の描画：視覚的潜在推論のためのテキスト連鎖思考の画像化

要旨

Chain-of-Thought（CoT）プロンプティングは、大規模言語モデル（LLM）の推論能力を解放する際に顕著な成功を収めてきました。しかし、CoTプロンプティングは推論を強化する一方で、その冗長性から多大な計算コストを課すという課題があります。近年の研究は結果の整合性に偏りがちで、中間推論プロセスに対する監督が不足しています。これらの欠点は、潜在的な推論連鎖の分析可能性を損なう要因となっています。こうした課題に対処するため、我々はRender-of-Thought（RoT）を提案します。これは、推論連鎖を具体化する初のフレームワークであり、テキストによるステップを画像としてレンダリングすることで、潜在的な理論的根拠を明示的かつ追跡可能なものとします。具体的には、既存の視覚言語モデル（VLM）の視覚エンコーダをセマンティックアンカーとして活用し、視覚的埋め込みとテキスト空間の整合を図ります。この設計により、追加の事前学習コストを発生させることなく、プラグアンドプレイでの実装が保証されます。数学的・論理的推論ベンチマークを用いた大規模な実験により、本手法が明示的なCoTと比較して3～4倍のトークン圧縮と大幅な推論の高速化を実現することを実証しました。さらに、他の手法に対しても遜色ない性能を維持し、本パラダイムの実現可能性を検証しています。コードはhttps://github.com/TencentBAC/RoT で公開しています。

English

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

思考の描画：視覚的潜在推論のためのテキスト連鎖思考の画像化

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

要旨

Support