ChatPaper.aiChatPaper

VTC-R1:面向高效长上下文推理的视觉-文本压缩技术

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

January 29, 2026
作者: Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao
cs.AI

摘要

長文本推理能力顯著增強了大語言模型處理複雜任務的能力,但由於計算複雜度問題也帶來了嚴重的效率瓶頸。現有的高效方法通常依賴複雜的附加訓練或借助外部模型進行壓縮,這限制了可擴展性並丟失了關鍵的細粒度信息。本文提出VTC-R1這一新型高效推理範式,將視覺-文本壓縮技術整合到推理過程中。該方法無需處理冗長的文本軌跡,而是將中間推理片段渲染成緊湊的圖像,並作為"光學記憶"迭代反饋給視覺語言模型。我們基於OpenR1-Math-220K構建訓練數據集,實現3.4倍的標記壓縮率,並對代表性視覺語言模型Glyph和Qwen3-VL進行微調。在MATH500、AIME25、AMC23和GPQA-D等基準測試上的大量實驗表明,VTC-R1始終優於標準長文本推理方法。此外,本方法顯著提升推理效率,端到端延遲加速達2.7倍,展現了其作為推理密集型應用可擴展解決方案的潛力。代碼已開源於https://github.com/w-yibo/VTC-R1。
English
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.
PDF73January 31, 2026