视觉语言模型是否必须依赖视觉变换器?评估状态空间模型作为视觉编码器的潜力
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
March 19, 2026
作者: Shang-Jui Ray Kuo, Paola Cascante-Bonilla
cs.AI
摘要
大型视觉语言模型(VLM)通常采用冻结的视觉骨干网络,其图像特征通过轻量级连接器映射至大语言模型。虽然基于Transformer的编码器是标准视觉骨干方案,但我们探究状态空间模型(SSM)视觉骨干能否成为有力替代方案。我们在受控环境下系统评估了SSM视觉骨干在VLM中的应用。在匹配的ImageNet-1K初始化条件下,SSM骨干在视觉问答与定位任务中均展现出最优的综合性能。我们进一步通过检测或分割训练对SSM和ViT系列骨干进行适配,发现密集任务调优能普遍提升各系列模型的性能;经过此类适配后,SSM骨干在显著更小的模型规模下仍保持竞争力。我们还发现:(i)更高的ImageNet准确率或更大的骨干网络未必能稳定转化为更好的VLM性能;(ii)部分视觉骨干在定位任务中存在不稳定性。基于这些发现,我们提出了稳定性增强策略以提升两类骨干网络的鲁棒性,并论证了SSM骨干可作为VLM中基于Transformer视觉编码器的有力替代方案。
English
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.