BiCLIP: 구조적 기하 변환을 통한 도메인 정규화

초록

비전-언어 모델(VLM)의 최근 발전은 놀라운 제로샷 능력을 보여주었지만, 이러한 모델을 전문 도메인에 적용하는 것은 여전히 중요한 과제로 남아 있습니다. 독립적으로 훈련된 VLM들이 표준 변환(canonical transformation)으로 연관된다는 최근 이론적 통찰을 바탕으로, 우리는 이러한 이해를 도메인 개념으로 확장합니다. 우리는 서로 다른 도메인 간의 이미지 특징들이 소수의 앵커(anchor)를 사용하여 복원 가능한 표준화된 기하학적 변환으로 연관되어 있다고 가정합니다. 소수 샷 분류(few-shot classification)는 제한된 레이블 샘플들이 이 변환을 추정하는 데 필요한 앵커 역할을 하기 때문에, 이러한 정렬(alignment)에 자연스러운 환경을 제공합니다. 이 가설에 동기를 부여받아, 우리는 교차 모달 정렬(cross-modal alignment)을 향상시키기 위해 다중 모달 특징에 대상 특화 변환을 적용하는 BiCLIP 프레임워크를 소개합니다. 우리의 접근 방식은 극도의 단순성과 적은 매개변수 사용량을 특징으로 합니다. EuroSAT, DTD, FGVCAircraft를 포함한 11개의 표준 벤치마크에 대한 포괄적인 평가를 통해 BiCLIP가 일관되게 최첨단 결과를 달성함을 입증합니다. 더 나아가, 학습된 변환의 직교성과 각도 분포를 분석하여 기존 기하학적 연구 결과에 대한 경험적 검증을 제공하며, 구조화된 정렬이 강력한 도메인 적응의 핵심임을 확인합니다. 코드는 https://github.com/QuantitativeImagingLaboratory/BilinearCLIP에서 확인할 수 있습니다.

English

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

BiCLIP: 구조적 기하 변환을 통한 도메인 정규화

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

초록

Support