BiCLIP：基于结构化几何变换的领域规范化方法

摘要

近期视觉语言模型（VLM）虽展现出卓越的零样本能力，但其在专业领域的适配仍面临重大挑战。基于最新理论研究发现——独立训练的VLM可通过规范变换相互关联，我们将这一认知延伸至领域概念。我们提出假设：不同领域的图像特征可通过一种规范化的几何变换建立联系，且该变换能通过少量锚点样本还原。少样本分类任务自然契合这种对齐机制，因为有限的标注样本恰好可作为估算该变换所需的锚点。基于此假设，我们提出BiCLIP框架，通过对多模态特征施加定向变换来增强跨模态对齐。该方法具有结构极简、参数量少的特点。在EuroSAT、DTD和FVGCAircraft等11个标准基准上的广泛实验表明，BiCLIP持续实现最先进性能。此外，我们通过分析习得变换的正交性与角度分布，对现有几何发现进行了实证验证，证实结构化对齐是实现鲁棒领域适应的关键。代码已发布于https://github.com/QuantitativeImagingLaboratory/BilinearCLIP。

English

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

BiCLIP：基于结构化几何变换的领域规范化方法

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

摘要

Support