VTC-R1: 効率的な長文脈推論のための視覚-テキスト圧縮

要旨

長文脈推論は大規模言語モデル（LLM）が複雑なタスクに取り組む能力を大幅に強化したが、計算量の増大により深刻な効率性のボトルネックも生み出している。既存の効率化手法は、複雑な追加学習や圧縮のための外部モデルに依存することが多く、拡張性を制限し、重要な細粒度情報を捨ててしまう問題がある。本論文では、視覚-テキスト圧縮を推論プロセスに統合した新しい効率的な推論パラダイムであるVTC-R1を提案する。VTC-R1は、長大なテキストの痕跡を処理する代わりに、中間推論セグメントをコンパクトな画像としてレンダリングし、これを「光学的メモリ」として視覚言語モデルに反復的にフィードバックする。OpenR1-Math-220Kに基づいて構築した学習データセットにより3.4倍のトークン圧縮を達成し、代表的なVLMであるGlyphとQwen3-VLをファインチューニングした。MATH500、AIME25、AMC23、GPQA-Dなどのベンチマークによる大規模な実験により、VTC-R1が標準的な長文脈推論を一貫して上回ることを実証した。さらに、本手法は推論効率を大幅に改善し、エンドツーエンドのレイテンシで2.7倍の高速化を達成しており、推論集約型アプリケーションにおける拡張性のあるソリューションとしての可能性を示している。コードはhttps://github.com/w-yibo/VTC-R1 で公開されている。

English

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

VTC-R1: 効率的な長文脈推論のための視覚-テキスト圧縮

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

要旨

Support