视觉语言模型仍难胜任基础计数任务

摘要

視覺語言模型（VLMs）在多模態複雜推理任務中表現卓越，卻在物體計數這類基礎感知能力上頻頻失準。現有評估多聚焦於最終輸出結果，難以揭示模型內部失效的具體環節。本研究通過行為分析與機理性探究，對VLM的計數行為展開實證研究。我們提出COUNTINGTRICKS評估框架——一套基於簡單幾何形狀的受控計數測試集，旨在系統性暴露模型在不同圖像分塊佈局與對抗式提示情境下的脆弱性。透過注意力分析與組件探測技術，我們發現計數相關的視覺證據在模態投影階段最為顯著，但在後續語言層中急劇衰減，此時模型更易受文本先驗影響。基於此發現，我們進一步評估了模態注意力共享（MAS）這種輕量級干預機制，該方法通過確保答案生成過程中的最小視覺注意力預算來強化視覺線索的利用。實驗結果表明，VLM的計數失誤不僅源於視覺感知局限，更與語言推理階段對視覺證據的利用不足密切相關。程式碼與資料集將於https://github.com/leduy99/-CVPRW26-Modality-Attention-Share 公開。

English

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

视觉语言模型仍难胜任基础计数任务

Counting to Four is still a Chore for VLMs

摘要

Support