BiCLIP：基于结构化几何变换的领域规范化方法

摘要

视觉语言模型（VLMs）的最新进展已展现出卓越的零样本能力，然而将这些模型适配到专业领域仍面临重大挑战。基于近期理论研究发现独立训练的VLMs可通过规范变换相互关联的启示，我们将这一认知拓展至领域适应的范畴。我们提出假设：不同领域的图像特征可通过一种规范化的几何变换相互关联，且该变换可利用少量锚点样本进行还原。小样本分类任务自然契合这种对齐机制，因为有限的标注样本恰好可作为估计该变换所需的锚点。基于此假设，我们提出了BiCLIP框架，通过对多模态特征施加定向变换来增强跨模态对齐。该方法具有极高的简洁性和低参数量特性。在EuroSAT、DTD和FVGCAircraft等11个标准基准上的广泛实验表明，BiCLIP持续实现最先进性能。此外，我们通过分析所学变换的正交性和角度分布，对现有几何发现进行了实证验证，证实结构化对齐是实现鲁棒领域适应的关键。代码已开源：https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

English

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP