ChatPaper.aiChatPaper

HeBA:面向鲁棒视觉语言模型的异构瓶颈适配器

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

March 17, 2026
作者: Md Jahidul Islam
cs.AI

摘要

大规模视觉语言模型(如CLIP)在下游任务适配时,常因采用"一刀切"的架构方案而受限——视觉与文本标记均通过宽泛的通用适配器进行统一处理。我们认为这种同质化处理忽视了不同模态的内在结构特性:图像的空间局部性与文本的语义密集性。为此,我们提出异质瓶颈适配器(HeBA),该统一架构框架通过引入模态特定的结构归纳偏置来解决这一问题。HeBA通过三项关键架构创新突破传统设计:(1)异质性:通过二维深度可分离卷积处理视觉标记以保持空间关联性,同时采用稠密线性投影差异化处理文本标记以捕捉语义关系;(2)瓶颈正则化:与标准扩展型适配器不同,HeBA采用压缩瓶颈结构(D→D/4),显式迫使模型学习紧凑鲁棒的特征,并起到结构正则化器的作用;(3)主动梯度初始化:我们突破限制性零初始化范式,采用Kaiming初始化策略确保充足的初始梯度流,在保持冻结主干网络预训练知识的同时加速收敛。大量实验表明,HeBA的架构专业化设计实现了卓越的稳定性与准确度,在11个少样本基准测试中创造了最新性能纪录。代码已开源:https://github.com/Jahid12012021/VLM-HeBA。
English
Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.
PDF12March 20, 2026