VLMはVision Transformerを必要とするか？視覚エンコーダとしての状態空間モデルの評価

要旨

大規模視覚言語モデル（VLM）では、凍結された視覚バックボーンが多用され、その画像特徴は軽量なコネクタを介して大規模言語モデルにマッピングされる。トランスフォーマーベースのエンコーダが標準的な視覚バックボーンであるが、我々は状態空間モデル（SSM）ベースの視覚バックボーンが有力な代替手段となり得るかどうかを検討する。本論文では、制御された環境下でVLM向けSSM視覚バックボーンを体系的に評価する。ImageNet-1K初期化条件を統一した場合、SSMバックボーンはVQAとグラウンディング/位置特定の両タスクで最も優れた総合性能を達成する。さらに、SSMおよびViTファミリーバックボーンを検出またはセグメンテーション学習で適応させたところ、密なタスクチューニングはファミリー間で性能を全般的に向上させることが分かり、この適応後もSSMバックボーンは大幅に小規模なモデルサイズでありながら競争力を維持する。さらに観察された点は、(i) ImageNet精度の向上やバックボーンの大規模化が必ずしも優れたVLM性能に繋がらないこと、(ii) 一部の視覚バックボーンは位置特定タスクで不安定になることである。これらの知見に基づき、両バックボーンファミリーのロバスト性を向上させる安定化戦略を提案し、SSMバックボーンがVLMにおけるトランスフォーマーベース視覚エンコーの強力な代替手段であることを示す。

English

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

VLMはVision Transformerを必要とするか？視覚エンコーダとしての状態空間モデルの評価

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

要旨

Support