PaLI-X:關於擴展多語言視覺與語言模型
PaLI-X: On Scaling up a Multilingual Vision and Language Model
May 29, 2023
作者: Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
cs.AI
摘要
我們介紹了擴展 PaLI-X 的訓練配方和結果,這是一個多語言視覺與語言模型,無論是在組件大小還是訓練任務組合的廣度方面都有所提升。我們的模型在各種各樣且複雜的任務上取得了新的性能水平,包括多個基於圖像的字幕生成和問答任務、基於圖像的文件理解和少樣本(上下文中)學習,以及物體檢測、視頻問答和視頻字幕生成。PaLI-X 在大多數視覺與語言基準測試中取得了最新的技術水平(25+個)。最後,我們觀察到新興的能力,例如複雜的計數和多語言物體檢測,這些任務並未明確包含在訓練中。
English
We present the training recipe and results of scaling up PaLI-X, a
multilingual vision and language model, both in terms of size of the components
and the breadth of its training task mixture. Our model achieves new levels of
performance on a wide-range of varied and complex tasks, including multiple
image-based captioning and question-answering tasks, image-based document
understanding and few-shot (in-context) learning, as well as object detection,
video question answering, and video captioning. PaLI-X advances the
state-of-the-art on most vision-and-language benchmarks considered (25+ of
them). Finally, we observe emerging capabilities, such as complex counting and
multilingual object detection, tasks that are not explicitly in the training
mix.