VISTA-Bench:视觉语言模型真的能像理解纯文本一样理解视觉化文本吗?
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
February 4, 2026
作者: Qing'an Liu, Juntong Feng, Yuhao Wang, Xinzhe Han, Yujie Cheng, Yue Zhu, Haiwen Diao, Yunzhi Zhuge, Huchuan Lu
cs.AI
摘要
视觉语言模型(VLMs)在文本与视觉输入的跨模态理解方面已取得显著成果,但现有基准测试主要聚焦于纯文本查询。现实场景中,语言常以嵌入图像的可视化文本形式出现,这引发了当前VLMs能否同等处理此类输入请求的思考。我们推出VISTA-Bench——一个涵盖多模态感知、推理到单模态理解领域的系统性基准测试。该基准通过对比受控渲染条件下的纯文本与可视化文本问题,评估模型的可视化文本理解能力。对20余个代表性VLMs的大规模评估揭示出显著的模态鸿沟:在纯文本查询中表现优异的模型,当相同语义内容以可视化文本呈现时,性能往往大幅下降。随着感知难度的增加,这种差距进一步扩大,表明尽管语义未变,模型对渲染差异仍具有高度敏感性。总体而言,VISTA-Bench提供了一个原则性评估框架,可用于诊断此类局限性,并推动符号化文本与像素级语言表征走向更统一的演进。源数据集已发布于https://github.com/QingAnLiu/VISTA-Bench。
English
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.