VTCBench:视觉语言模型能否通过视觉文本压缩理解长上下文?
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
December 17, 2025
作者: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang
cs.AI
摘要
大型语言模型(LLM)扩展上下文窗口所带来的计算与内存开销严重制约了其可扩展性。视觉文本压缩(VTC)作为一种值得关注的解决方案,通过DeepSeek-OCR和Glyph等框架将长文本转化为密集的二维视觉表征,实现了3至20倍的标记压缩率。然而,这种高信息密度对视觉语言模型(VLM)核心长上下文能力的影响尚未得到充分研究。为填补这一空白,我们首次提出VTC专项评测基准,系统评估VLM在三种长上下文理解场景下的表现:VTC检索(评估模型检索与整合信息的能力)、VTC推理(要求模型通过潜在关联推断定位词汇重叠度最低的事实)以及VTC记忆(衡量长期对话记忆中的综合问答能力)。此外,我们还构建了VTCBench-Wild以模拟多样化输入场景。通过对主流开源与商业模型的全面评测,研究发现尽管大多数VLM能良好解码文本信息(如OCR),但在处理VTC压缩信息时表现出惊人的长上下文理解缺陷,难以捕捉上下文中的长程关联与依赖。本研究为深入理解VTC提供了重要依据,并为设计更高效、可扩展的VLM奠定了理论基础。
English
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.