생각의 렌더링: 시각적 잠재 추론을 위한 텍스트적 사고의 연쇄를 이미지로 구현하기

초록

체인 오브 쏘트(CoT) 프롬프팅은 대규모 언어 모델(LLM)의 추론 능력을 활성화하는 데 있어 주목할 만한 성과를 거두었습니다. CoT 프롬프팅은 추론 능력을 향상시키지만, 과도한 텍스트 길이로 인해 상당한 계산 부하가 발생합니다. 최근 연구들은 주로 결과 정렬에만 집중하고 중간 추론 과정에 대한 감독이 부족한 경우가 많습니다. 이러한 한계는 잠재적 추론 체인의 분석 가능성을 저해합니다. 이러한 문제를 해결하기 위해 우리는 추론 체인의 구체화를 위해 텍스트 단계를 이미지로 렌더링하여 잠재적 논리를 명시적이고 추적 가능하게 만드는 최초의 프레임워크인 렌더 오브 쏘트(RoT)를 제안합니다. 구체적으로, 우리는 기존 시각 언어 모델(VLM)의 시각 인코더를 의미론적 앵커로 활용하여 시각 임베딩과 텍스트 공간을 정렬합니다. 이러한 설계는 추가 사전 학습 부담 없이 플러그 앤 플레이 방식의 구현을 보장합니다. 수학적 및 논리적 추론 벤치마크에서의 광범위한 실험을 통해 우리의 방법이 명시적 CoT 대비 3-4배의 토큰 압축률과 상당한 추론 가속화를 달성함을 입증했습니다. 또한 다른 방법들 대비 경쟁력 있는 성능을 유지하며 이 패러다임의 실현 가능성을 검증하였습니다. 우리의 코드는 https://github.com/TencentBAC/RoT에서 확인할 수 있습니다.

English

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

생각의 렌더링: 시각적 잠재 추론을 위한 텍스트적 사고의 연쇄를 이미지로 구현하기

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

초록

Support