HeBA: 강력한 시각-언어 모델을 위한 이질적 병목 어댑터

초록

CLIP과 같은 대규모 시각-언어 모델(VLM)을 다운스트림 작업에 적용할 때는 시각 및 텍스트 토큰이 넓고 일반적인 어댑터에 의해 균일하게 처리되는 "일관성 있는(one-size-fits-all)" 아키텍처 접근 방식의 한계를 겪는 경우가 많습니다. 본 연구에서는 이러한 동질성(homogeneity)이 양식(modality) 간의 고유한 구조적 특성, 즉 이미지의 공간적 지역성(spatial locality)과 텍스트의 의미론적 밀도(semantic density)를 무시한다고 주장합니다. 이를 해결하기 위해 우리는 양식별 구조적 귀납 편향(modality-specific structural inductive biases)을 도입하는 통합 아키텍처 프레임워크인 HeBA(Heterogeneous Bottleneck Adapter)를 제안합니다. HeBA는 세 가지 핵심 아키텍처 혁신을 통해 기존 설계와 차별화됩니다: (1) 이질성(Heterogeneity): 공간 상관 관계를 보존하기 위해 2D 깊이별 분리 가능 합성곱(2D depthwise-separable convolutions)을 통해 시각 토큰을 처리하는 반면, 의미론적 관계를 포착하기 위해 밀집 선형 투영(dense linear projections)을 통해 텍스트 토큰을 구별적으로 처리합니다; (2) 병목 규제(Bottleneck Regularization): 표준 확장형 어댑터와 달리, HeBA는 압축 병목 현상(D -> D/4)을 활용하여 모델이 컴팩트하고 강력한 특징을 명시적으로 학습하도록 강제하며 구조적 규제자(structural regularizer) 역할을 합니다; (3) 능동적 기울기 초기화(Active Gradient Initialization): 우리는 제한적인 영점 초기화(zero-initialization) 패러다임에 도전하여, 고정된 백본 네트워크(frozen backbone)의 사전 학습된 지식을 훼손하지 않으면서 수렴 속도를 가속화하기 위해 충분한 초기 기울기 흐름을 보장하는 Kaiming 초기화 전략을 활용합니다. 폭넓은 실험을 통해 HeBA의 아키텍처적으로 특화된 설계가 우수한 안정성과 정확도를 달성하며, 11개의 few-shot 벤치마크에서 새로운 최첨단(state-of-the-art) 성능을确立함을 입증합니다. 코드는 https://github.com/Jahid12012021/VLM-HeBA에서 확인할 수 있습니다.

English

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

HeBA: 강력한 시각-언어 모델을 위한 이질적 병목 어댑터

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

초록

Support