HeBA：ロバストな視覚言語モデルのための異種ボトルネックアダプター

要旨

大規模視覚言語モデル（VLM）であるCLIPなどを下流タスクに適応させる際には、視覚トークンとテキストトークンが広範で汎用的なアダプタによって一律に処理される、「万能型」のアーキテクチャ手法がしばしば見られます。我々は、この均一性が、画像の空間的局所性とテキストの意味的密度という、モダリティ間の本質的に異なる構造的特性を無視していると主張します。この問題に対処するため、我々はモダリティ特有の構造的帰納バイアスを導入する統一アーキテクチャフレームワーク、HeBA（Heterogeneous Bottleneck Adapter）を提案します。HeBAは、従来の設計から以下の3つの主要なアーキテクチャ革新により逸脱します：(1) 異種性：空間的相関を保持するため2D深度分離可能畳み込みにより視覚トークンを処理し、一方で意味的関係を捉えるため密な線形投影によりテキストトークンを区別して処理します。(2) ボトルネック正則化：標準的な拡大型アダプタとは異なり、HeBAは圧縮ボトルネック（D -> D/4）を採用し、明示的にコンパクトでロバストな特徴の学習を強制し、構造的正則化として機能します。(3) 活性化勾配初期化：制限的なゼロ初期化パラダイムに挑戦し、凍結されたバックボーンの事前学習済み知識を損なうことなく収束を加速する十分な初期勾配流を保証するKaiming初期化戦略を利用します。大規模な実験により、HeBAのアーキテクチャ的に特化した設計が優れた安定性と精度を達成し、11のFew-Shotベンチマークで新たなstate-of-the-artを確立することを実証しました。コードはhttps://github.com/Jahid12012021/VLM-HeBA で公開されています。

English

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

HeBA：ロバストな視覚言語モデルのための異種ボトルネックアダプター

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

要旨

Support