VTC-R1: Visie-Text Compressie voor Efficiënt Redeneren met Lange Contexten

Samenvatting

Langetermijnredenering heeft grote taalmodellen (LLM's) aanzienlijk in staat gesteld om complexe taken aan te pakken, maar introduceert tegelijkertijd ernstige efficiëntieproblemen vanwege de computationele complexiteit. Bestaande efficiënte benaderingen zijn vaak afhankelijk van complexe aanvullende training of externe modellen voor compressie, wat de schaalbaarheid beperkt en cruciaal fijnmazige informatie weglaat. In dit artikel stellen we VTC-R1 voor, een nieuwe efficiënte redeneerparadigma dat visie-tekstcompressie integreert in het redeneerproces. In plaats van lange tekstuele sporen te verwerken, rendert VTC-R1 tussenliggende redeneersegmenten naar compacte afbeeldingen, die iteratief worden teruggevoerd naar visie-taalmodellen als "optisch geheugen". We construeren een trainingsdataset gebaseerd op OpenR1-Math-220K, die een tokencompressie van 3.4x bereikt, en fine-tunen representatieve VLM's - Glyph en Qwen3-VL. Uitgebreide experimenten op benchmarks zoals MATH500, AIME25, AMC23 en GPQA-D tonen aan dat VTC-R1 consequent superieur presteert aan standaard langetermijnredenering. Bovendien verbetert onze aanpak de inferentie-efficiëntie aanzienlijk, met een 2.7x versnelling in end-to-end latentie, wat het potentieel ervan als een schaalbare oplossing voor reasoning-intensieve toepassingen benadrukt. Onze code is beschikbaar op https://github.com/w-yibo/VTC-R1.

English

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

VTC-R1: Visie-Text Compressie voor Efficiënt Redeneren met Lange Contexten

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Samenvatting

Support