SigLIP 2：具備增強語義理解、定位與密集特徵的多語言視覺-語言編碼器

摘要

我們推出SigLIP 2，這是一系列基於原始SigLIP成功基礎上的新型多語言視覺-語言編碼器。在這一版本中，我們將多種先前獨立開發的技術整合到一個統一的訓練方案中，擴展了原有的圖像-文本訓練目標——這包括基於字幕的預訓練、自監督損失（自蒸餾、掩碼預測）以及在線數據篩選。通過這些改進，SigLIP 2模型在所有模型規模上，在核心能力方面均超越了其前代SigLIP，這些能力包括零樣本分類、圖像-文本檢索，以及為視覺-語言模型（VLMs）提取視覺表徵時的遷移性能。此外，新的訓練方案在定位和密集預測任務上帶來了顯著提升。我們還訓練了支持多種分辨率並保留輸入原始縱橫比的變體。最後，我們在包含去偏技術的更為多樣化的數據混合上進行訓練，從而大幅提升了多語言理解能力和公平性。為了讓用戶能夠在推理成本與性能之間做出權衡，我們發布了四種規模的模型檢查點：ViT-B（86M）、L（303M）、So400m（400M）和g（1B）。

English

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

SigLIP 2：具備增強語義理解、定位與密集特徵的多語言視覺-語言編碼器

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

摘要

Support