PaLI-3 視覺語言模型:更小、更快、更強。
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
October 13, 2023
作者: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
cs.AI
摘要
本文介紹了 PaLI-3,一種更小、更快、更強大的視覺語言模型(VLM),與大小為其 10 倍的類似模型相比表現優異。為了達到這種強大性能,我們比較了使用分類目標預訓練的 Vision Transformer(ViT)模型和對比式(SigLIP)預訓練模型。我們發現,儘管在標準圖像分類基準測試上表現稍遜,基於 SigLIP 的 PaLI 在各種多模態基準測試中表現優異,特別是在定位和視覺文本理解方面。我們將 SigLIP 圖像編碼器擴展到 20 億參數,並在多語言跨模態檢索上實現了新的最先進水準。我們希望 PaLI-3,僅具有 50 億參數,重新激發對複雜 VLM 基本組件的研究,並可能推動新一代規模化模型的發展。
English
This paper presents PaLI-3, a smaller, faster, and stronger vision language
model (VLM) that compares favorably to similar models that are 10x larger. As
part of arriving at this strong performance, we compare Vision Transformer
(ViT) models pretrained using classification objectives to contrastively
(SigLIP) pretrained ones. We find that, while slightly underperforming on
standard image classification benchmarks, SigLIP-based PaLI shows superior
performance across various multimodal benchmarks, especially on localization
and visually-situated text understanding. We scale the SigLIP image encoder up
to 2 billion parameters, and achieves a new state-of-the-art on multilingual
cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles
research on fundamental pieces of complex VLMs, and could fuel a new generation
of scaled-up models.