视觉语言模型是否需要视觉变换器？评估状态空间模型作为视觉编码器的表现

摘要

大型视觉语言模型（VLMs）通常采用冻结的视觉骨干网络，其图像特征通过轻量级连接器映射至大语言模型。尽管基于Transformer的编码器是标准视觉骨干，我们探究状态空间模型（SSM）视觉骨干能否成为有力替代方案。我们在受控环境下系统评估了VLMs中SSM视觉骨干的性能。在匹配的ImageNet-1K初始化条件下，SSM骨干在视觉问答与定位任务中均展现出最优的综合性能。我们进一步通过检测或分割训练对SSM和ViT系列骨干进行适配，发现密集任务调优普遍能提升各系列模型的性能；经此适配后，SSM骨干在显著更小的模型规模下仍保持竞争力。我们还观察到：（i）更高的ImageNet精度或更大的骨干网络未必能可靠转化为更好的VLM性能；（ii）部分视觉骨干在定位任务中存在不稳定性。基于这些发现，我们提出稳定性提升策略以增强两类骨干网络的鲁棒性，并强调SSM骨干可作为VLMs中基于Transformer的视觉编码器的有力替代方案。

English

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.