ZooClaw-FashionSigLIP2：蒸留ファインチューニングによるロバストなファッション検索

要旨

基礎視覚言語エンコーダを特化型検索タスクに適応させる際には、本質的なトレードオフが生じる。すなわち、対象分布における性能向上は、基盤モデルの広範な汎化能力を犠牲にすることでもたらされ、ファッション検索はこの問題の厳しい事例である。本稿では、このトレードオフをシンプルな手法で解決するファッション特化型SigLIP2-baseモデル、ZooClaw-FashionSigLIP2を提案する。手法は、厳選されたドメイン内データを用いた知識蒸留を伴う完全ファインチューニングと、その後にWise-FT（Wortsman et al., 2022）による重み補間をベースモデルとの間で行うというものである。本モデルは、LoRA、より大規模なバックボーン（最大10億パラメータ）、外部学習データを上回る性能を示す。公平な評価の下で、ZooClaw-FashionSigLIP2は、我々のスイート内のすべてのベンチマークにおいて、全てのベースラインを凌駕する。さらに、新たな高品質ファッション検索ベンチマークであるZooClaw-Fashion、および広く利用されているベンチマークの系統的な品質分析を公開し、その公開正解データにおける構造的バイアスを明らかにし軽減する。今後の研究を促進するため、モデルの重みと評価成果物を全てオープンソース化する。

English

Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.