Glyph：通过视觉-文本压缩扩展上下文窗口

摘要

大型语言模型（LLMs）在处理文档理解、代码分析及多步推理等任务时，日益依赖于长上下文建模。然而，将上下文窗口扩展至百万令牌级别会带来难以承受的计算与内存成本，限制了长上下文LLMs的实际应用。本研究另辟蹊径，采用视觉上下文扩展策略应对这一挑战。不同于传统的基于令牌序列的扩展方法，我们提出了Glyph框架，该框架将长文本渲染为图像，并利用视觉-语言模型（VLMs）进行处理。此方法在保留语义信息的同时，大幅压缩了文本输入，并进一步设计了一种基于LLM的遗传搜索算法，以识别在准确性与压缩率之间达到最佳平衡的视觉渲染配置。通过大量实验，我们验证了该方法在多种长上下文基准测试中，实现了3至4倍的令牌压缩，同时保持了与Qwen3-8B等领先LLMs相当的准确性。这种压缩还带来了约4倍的预填充与解码速度提升，以及约2倍的SFT训练加速。此外，在极端压缩条件下，具备128K上下文的VLM能够扩展至处理百万令牌级别的文本任务。同时，渲染后的文本数据对现实世界的多模态任务，如文档理解，也大有裨益。我们的代码与模型已发布于https://github.com/thu-coai/Glyph。

English

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

Glyph：通过视觉-文本压缩扩展上下文窗口

Glyph: Scaling Context Windows via Visual-Text Compression

摘要

Support