BiCLIP: 構造化された幾何学的変換による領域正規化

要旨

視覚言語モデル（VLM）の近年の進歩は、驚異的なゼロショット能力を示しているものの、これらのモデルを専門領域に適応させることは依然として重要な課題である。独立に訓練されたVLMが正準変換によって関連付けられるという最近の理論的知見を基盤として、我々はこの理解を「領域」の概念に拡張する。異なる領域に跨る画像特徴量は、少数のアンカーを用いて回復可能な正準化された幾何学的変換によって関連付けられると仮説を立てる。少数ショット分類は、限られたラベル付きサンプルがこの変換を推定するために必要なアンカーとして機能するため、このアラインメントにとって自然な設定を提供する。この仮説に動機付けられて、我々はBiCLIPを提案する。これは、クロスモーダルなアラインメントを強化するためにマルチモーダル特徴量に特定の変換を適用するフレームワークである。本手法は、その極めて高い簡潔さと少ないパラメータ数を特徴とする。EuroSAT、DTD、FGVCAircraftを含む11の標準ベンチマークによる広範な評価を通じて、BiCLIPが一貫してState-of-the-Artの結果を達成することを実証する。さらに、学習された変換の直交性と角度分布を分析することで、既存の幾何学的知見を実証的に検証し、構造化されたアラインメントがロバストな領域適応の鍵であることを確認する。コードはhttps://github.com/QuantitativeImagingLaboratory/BilinearCLIP で公開されている。

English

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

BiCLIP: 構造化された幾何学的変換による領域正規化

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

要旨

Support