视觉语言模型仍难胜任四位数计数任务

摘要

视觉-语言模型（VLMs）在多模态推理任务中展现出卓越性能，但在物体计数等基础感知能力上仍存在明显缺陷。现有评估多聚焦于最终输出结果，难以揭示模型内部失效的具体环节。本研究通过行为分析与机制解析相结合的方式，对VLM的计数行为展开实证研究。我们提出COUNTINGTRICKS评估基准——一套基于简单几何图形的受控计数测试集，用于揭示模型在不同图像分块布局和对抗性提示条件下的脆弱性。通过注意力分析与组件探测，我们发现计数相关的视觉证据在模态投影阶段最强，但在后续语言层中显著衰减，导致模型更易受文本先验影响。基于此发现，我们进一步评估了模态注意力共享（MAS）这一轻量级干预方法，该方法通过在答案生成阶段强制保留最低限度的视觉注意力预算。实验结果表明，VLM的计数失败不仅源于视觉感知局限，更与语言推理阶段对视觉证据的利用不足密切相关。代码与数据集将在https://github.com/leduy99/-CVPRW26-Modality-Attention-Share发布。

English

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.