SigLIP 2: セマンティック理解、ローカライゼーション、および高密度特徴を強化した多言語視覚言語エンコーダ

要旨

私たちは、オリジナルのSigLIPの成功を基盤とした新しい多言語視覚言語エンコーダのファミリーであるSigLIP 2を紹介します。この第2世代では、キャプションベースの事前学習、自己教師あり損失（自己蒸留、マスク予測）、オンラインデータキュレーションなど、複数の独立して開発された技術を統合したレシピに基づいて、元の画像-テキスト学習目標を拡張しました。これらの変更により、SigLIP 2モデルは、ゼロショット分類、画像-テキスト検索、視覚言語モデル（VLM）の視覚表現抽出時の転移性能など、すべてのモデル規模において、SigLIPの対応モデルを凌駕しています。さらに、新しい学習レシピは、ローカライゼーションや密な予測タスクにおいても大幅な改善をもたらします。また、複数の解像度をサポートし、入力の元のアスペクト比を保持するバリアントも学習します。最後に、バイアス除去技術を含むより多様なデータミックスで学習を行うことで、多言語理解が大幅に向上し、公平性も改善されています。ユーザーが推論コストとパフォーマンスをトレードオフできるように、4つのサイズのモデルチェックポイントをリリースしました：ViT-B（86M）、L（303M）、So400m（400M）、g（1B）。

English

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

SigLIP 2: セマンティック理解、ローカライゼーション、および高密度特徴を強化した多言語視覚言語エンコーダ

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

要旨

Support