VTC-R1: 효율적인 장문 맥락 추론을 위한 비전-텍스트 압축

초록

장문 맥락 추론은 대규모 언어 모델(LLM)이 복잡한 작업을 처리하는 능력을 크게 향상시켰지만, 계산 복잡도로 인해 심각한 효율성 병목 현상을 초래합니다. 기존의 효율적 접근법들은 복잡한 추가 학습이나 압축을 위한 외부 모델에 의존하는 경우가 많아 확장성을 제한하고 중요한 세부 정보를 누락시키곤 합니다. 본 논문에서는 시각-텍스트 압축을 추론 과정에 통합한 새로운 효율적 추론 패러다임인 VTC-R1을 제안합니다. VTC-R1은 긴 텍스트 추적 기록을 처리하는 대신, 중간 추론 세그먼트를 간결한 이미지로 렌더링하여 이를 "광학 메모리"로 비전-언어 모델에 반복적으로 피드백합니다. 우리는 OpenR1-Math-220K를 기반으로 3.4배의 토큰 압축률을 달성하는 학습 데이터셋을 구축하고 대표적인 VLM인 Glyph와 Qwen3-VL을 미세 조정했습니다. MATH500, AIME25, AMC23, GPQA-D와 같은 벤치마크에서의 광범위한 실험을 통해 VTC-R1이 표준 장문 맥락 추론을 지속적으로 능가함을 입증했습니다. 더불어, 본 접근법은 추론 효율성을 크게 개선하여 종단 간 지연 시간에서 2.7배의 속도 향상을 달성하며, 추론 집약적 애플리케이션을 위한 확장 가능한 솔루션으로서의 잠재력을 부각했습니다. 우리의 코드는 https://github.com/w-yibo/VTC-R1에서 확인할 수 있습니다.

English

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

VTC-R1: 효율적인 장문 맥락 추론을 위한 비전-텍스트 압축

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

초록

Support