PaLI-3 視覺語言模型：更小、更快、更強。

摘要

本文介紹了 PaLI-3，一種更小、更快、更強大的視覺語言模型（VLM），與大小為其 10 倍的類似模型相比表現優異。為了達到這種強大性能，我們比較了使用分類目標預訓練的 Vision Transformer（ViT）模型和對比式（SigLIP）預訓練模型。我們發現，儘管在標準圖像分類基準測試上表現稍遜，基於 SigLIP 的 PaLI 在各種多模態基準測試中表現優異，特別是在定位和視覺文本理解方面。我們將 SigLIP 圖像編碼器擴展到 20 億參數，並在多語言跨模態檢索上實現了新的最先進水準。我們希望 PaLI-3，僅具有 50 億參數，重新激發對複雜 VLM 基本組件的研究，並可能推動新一代規模化模型的發展。

English

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

PaLI-3 視覺語言模型：更小、更快、更強。

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

摘要

Support