HeBA：面向鲁棒视觉语言模型的异构瓶颈适配器

摘要

大规模视觉语言模型（如CLIP）在下游任务适配时，常因采用"一刀切"的架构方案而受限——视觉与文本标记均通过宽泛的通用适配器进行统一处理。我们认为这种同质化处理忽视了不同模态的内在结构特性：图像的空间局部性与文本的语义密集性。为此，我们提出异质瓶颈适配器（HeBA），该统一架构框架通过引入模态特定的结构归纳偏置来解决这一问题。HeBA通过三项关键架构创新突破传统设计：（1）异质性：通过二维深度可分离卷积处理视觉标记以保持空间关联性，同时采用稠密线性投影差异化处理文本标记以捕捉语义关系；（2）瓶颈正则化：与标准扩展型适配器不同，HeBA采用压缩瓶颈结构（D→D/4），显式迫使模型学习紧凑鲁棒的特征，并起到结构正则化器的作用；（3）主动梯度初始化：我们突破限制性零初始化范式，采用Kaiming初始化策略确保充足的初始梯度流，在保持冻结主干网络预训练知识的同时加速收敛。大量实验表明，HeBA的架构专业化设计实现了卓越的稳定性与准确度，在11个少样本基准测试中创造了最新性能纪录。代码已开源：https://github.com/Jahid12012021/VLM-HeBA。

English

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.