PaLI-X：關於擴展多語言視覺與語言模型

摘要

我們介紹了擴展 PaLI-X 的訓練配方和結果，這是一個多語言視覺與語言模型，無論是在組件大小還是訓練任務組合的廣度方面都有所提升。我們的模型在各種各樣且複雜的任務上取得了新的性能水平，包括多個基於圖像的字幕生成和問答任務、基於圖像的文件理解和少樣本（上下文中）學習，以及物體檢測、視頻問答和視頻字幕生成。PaLI-X 在大多數視覺與語言基準測試中取得了最新的技術水平（25+個）。最後，我們觀察到新興的能力，例如複雜的計數和多語言物體檢測，這些任務並未明確包含在訓練中。

English

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

PaLI-X：關於擴展多語言視覺與語言模型

PaLI-X: On Scaling up a Multilingual Vision and Language Model

摘要

Support