四つまで数えるのもVLMには難題

要旨

視覚言語モデル（VLM）は複雑なマルチモーダル推論タスクにおいて目覚ましい性能を達成しているが、物体カウントのような単純な基礎的スキルでは未だに失敗する。既存の評価は最終出力のみを評価する場合が多く、モデル内部のどこでこれらの失敗が生じるかについての知見は限られている。本研究では、行動分析とメカニズム分析の両面を通じて、VLMのカウント行動に関する実証的研究を行う。我々は、COUNTINGTRICKSを提案する。これは、異なるパッチ化レイアウトおよび敵対的プロンプト条件下での脆弱性を明らかにするために設計された、単純な形状ベースのカウント事例からなる制御評価スイートである。注意分析とコンポーネント単位のプロービングを用いて、カウントに関連する視覚的証拠がモダリティ投影段階では最も強いが、後の言語層では大幅に劣化し、モデルがテキスト事前分布に影響されやすくなることを示す。この知見に基づき、我々はさらに、回答生成中に最小限の視覚的注意を確保する軽量な介入手法であるModality Attention Share（MAS）を評価する。結果は、VLMにおけるカウント失敗が視覚知覚の限界のみならず、言語段階の推論における視覚的証拠の過小利用に起因することを示唆する。コードとデータセットはhttps://github.com/leduy99/-CVPRW26-Modality-Attention-Shareで公開予定である。

English

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

四つまで数えるのもVLMには難題

Counting to Four is still a Chore for VLMs

要旨

Support