Grid2Matrix：揭示视觉语言模型中的数字失认症

摘要

视觉语言模型（VLMs）在多模态推理基准测试中表现卓越，但这些评估通常不需要对图像进行详尽解读，因此可能掩盖模型在准确捕捉所有视觉细节方面的缺陷。我们推出Grid2Matrix（G2M）这一受控基准测试：模型需根据显示的彩色网格和颜色-数字映射关系输出对应矩阵。通过调整网格尺寸与颜色数量，G2M能以最小化语义干扰的方式实现视觉复杂度的可控提升。研究发现，VLMs在零样本端到端评估中会出现早期性能断崖式下跌——在极小网格上即告失败，而非随任务密度增加逐渐退化。我们对两个代表性VLM家族的视觉编码器进行探测，发现其保留的网格信息远超端到端输出结果。这表明失败原因不能仅归咎于视觉编码环节，还反映了视觉特征可恢复信息与最终语言表达之间的断层。我们将此现象命名为“数字辨识障碍”。进一步分析表明，这些错误具有高度结构性，且与网格单元和视觉分块边界的重叠情况密切相关。研究还发现，模型缩放和多模态对齐等常见策略均无法完全消除此类故障。我们期待G2M成为重要测试平台，既可探究VLMs丢失视觉细节的环节与机制，也能评估表格、图表、表单和图形界面等对细微视觉信息敏感的任务表现。

English

Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap Digital Agnosia. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.