글리프: 시각-텍스트 압축을 통한 컨텍스트 윈도우 확장

초록

대규모 언어 모델(LLMs)은 문서 이해, 코드 분석, 다단계 추론과 같은 작업을 위해 점점 더 긴 문맥 모델링에 의존하고 있습니다. 그러나 문맥 윈도우를 백만 토큰 수준으로 확장하는 것은 과도한 계산 및 메모리 비용을 초래하여, 긴 문맥 LLMs의 실용성을 제한하고 있습니다. 본 연구에서는 이러한 문제를 해결하기 위해 시각적 문맥 확장이라는 다른 관점을 취합니다. 토큰 기반 시퀀스를 확장하는 대신, Glyph라는 프레임워크를 제안하여 긴 텍스트를 이미지로 렌더링하고 이를 시각-언어 모델(VLMs)로 처리합니다. 이 접근법은 텍스트 입력을 상당히 압축하면서도 의미 정보를 보존하며, 정확도와 압축률을 균형 있게 조절하기 위해 LLM 기반 유전자 탐색을 설계하여 최적의 시각적 렌더링 구성을 식별합니다. 광범위한 실험을 통해, 우리의 방법이 다양한 긴 문맥 벤치마크에서 Qwen3-8B와 같은 선도적인 LLMs와 비슷한 정확도를 유지하면서 3-4배의 토큰 압축을 달성함을 입증했습니다. 이 압축은 또한 프리필링 및 디코딩 속도를 약 4배, SFT 훈련 속도를 약 2배 향상시킵니다. 더 나아가, 극단적인 압축 하에서 128K 문맥 VLM은 1M 토큰 수준의 텍스트 작업을 처리할 수 있도록 확장될 수 있습니다. 또한, 렌더링된 텍스트 데이터는 문서 이해와 같은 실제 세계의 다중 모달 작업에도 이점을 제공합니다. 우리의 코드와 모델은 https://github.com/thu-coai/Glyph에서 공개되었습니다.

English

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

글리프: 시각-텍스트 압축을 통한 컨텍스트 윈도우 확장

Glyph: Scaling Context Windows via Visual-Text Compression

초록

Support