HeBA: Heterogene Knelpunt-Adapters voor Robuuste Visie-Taalmodellen

Samenvatting

Het aanpassen van grootschalige Vision-Language Models (VLM's) zoals CLIP aan downstreamtaken lijdt vaak onder een "one-size-fits-all" architecturale aanpak, waarbij visuele en tekstuele tokens uniform worden verwerkt door brede, generieke adapters. Wij beargumenteren dat deze homogeniteit de onderscheidende structurele aard van de modaliteiten negeert – spatiale localiteit in beelden versus semantische dichtheid in tekst. Om dit aan te pakken, stellen wij HeBA (Heterogeneous Bottleneck Adapter) voor, een uniform architecturaal raamwerk dat modaliteit-specifieke structurele inductieve biases introduceert. HeBA wijkt af van conventionele ontwerpen door drie belangrijke architecturale innovaties: (1) Heterogeniteit: Het verwerkt visuele tokens via 2D depthwise-separable convoluties om spatiale correlaties te behouden, terwijl het tekstuele tokens onderscheidend verwerkt via dense lineaire projecties om semantische relaties vast te leggen; (2) Bottleneck Regularisatie: In tegenstelling tot standaard uitdijende adapters, hanteert HeBA een compressie-bottleneck (D -> D/4) die het model expliciet forceert compacte, robuuste features te leren en fungeert als een structurele regularisator; en (3) Actieve Gradient Initialisatie: Wij dagen het restrictieve zero-initialisatie paradigma uit door een Kaiming-initialisatiestrategie te gebruiken die voldoende initiële gradientstroom verzekert om convergentie te versnellen zonder de voorgetrainde kennis van de bevroren backbone aan te tasten. Uitgebreide experimenten tonen aan dat HeBA's architecturaal gespecialiseerde ontwerp superieure stabiliteit en nauwkeurigheid bereikt, en een nieuwe state-of-the-art vestigt op 11 few-shot benchmarks. Code is beschikbaar op https://github.com/Jahid12012021/VLM-HeBA.

English

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

HeBA: Heterogene Knelpunt-Adapters voor Robuuste Visie-Taalmodellen

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

Samenvatting

Support