BiCLIP : Canononisation de domaine par transformation géométrique structurée

Résumé

Les progrès récents des modèles vision-langage (VLM) ont démontré des capacités remarquables en zéro-shot, mais l'adaptation de ces modèles à des domaines spécialisés reste un défi majeur. En nous appuyant sur des avancées théoriques récentes suggérant que les VLM entraînés indépendamment sont reliés par une transformation canonique, nous étendons cette compréhension au concept de domaines. Nous émettons l'hypothèse que les caractéristiques d'image entre des domaines disparates sont reliées par une transformation géométrique canonisable qui peut être retrouvée à l'aide d'un petit ensemble d'ancres. La classification en few-shot offre un cadre naturel pour cet alignement, car les échantillons étiquetés limités servent d'ancres nécessaires pour estimer cette transformation. Motivés par cette hypothèse, nous présentons BiCLIP, un cadre qui applique une transformation ciblée aux caractéristiques multimodales pour améliorer l'alignement intermodal. Notre approche se caractérise par son extrême simplicité et son faible empreinte paramétrique. Des évaluations approfondies sur 11 benchmarks standards, incluant EuroSAT, DTD et FGVCAircraft, démontrent que BiCLIP atteint constamment des résultats state-of-the-art. De plus, nous fournissons une vérification empirique de résultats géométriques existants en analysant l'orthogonalité et la distribution angulaire des transformations apprises, confirmant qu'un alignement structuré est la clé d'une adaptation de domaine robuste. Le code est disponible à l'adresse https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

English

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

BiCLIP : Canononisation de domaine par transformation géométrique structurée

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Résumé

Support