VTC-R1:面向高效长上下文推理的视觉-文本压缩技术
VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
January 29, 2026
作者: Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao
cs.AI
摘要
长文本推理能力显著增强了大语言模型处理复杂任务的能力,但由于计算复杂度较高,也带来了严重的效率瓶颈。现有高效方法通常依赖复杂的附加训练或借助外部模型进行压缩,这限制了可扩展性并丢失了关键的细粒度信息。本文提出VTC-R1这一新型高效推理范式,将视觉-文本压缩技术融入推理过程。该范式不再处理冗长的文本轨迹,而是将中间推理片段渲染为紧凑图像,以"光学记忆"的形式迭代反馈给视觉语言模型。基于OpenR1-Math-220K构建的训练数据集实现了3.4倍的token压缩率,并对代表性视觉语言模型Glyph和Qwen3-VL进行微调。在MATH500、AIME25、AMC23和GPQA-D等基准测试上的大量实验表明,VTC-R1持续优于标准长文本推理方法。此外,该方法显著提升了推理效率,端到端延迟加速达2.7倍,凸显其作为推理密集型应用可扩展解决方案的潜力。代码已开源:https://github.com/w-yibo/VTC-R1。
English
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.