PaLI-X: 多言語視覚と言語モデルのスケールアップに関する研究

要旨

我々は、多言語視覚言語モデルであるPaLI-Xのトレーニングレシピと、コンポーネントの規模とトレーニングタスクの多様性の両面におけるスケールアップの結果を紹介します。本モデルは、複数の画像ベースのキャプショニングや質問応答タスク、画像ベースの文書理解、少数ショット（インコンテキスト）学習、さらには物体検出、動画質問応答、動画キャプショニングなど、多様で複雑なタスクにおいて新たな性能レベルを達成しました。PaLI-Xは、検討された視覚言語ベンチマークの大半（25以上）において、最先端の性能を向上させています。最後に、複雑な計数や多言語物体検出など、トレーニングミックスに明示的に含まれていないタスクにおいても、新たな能力が発現することを観察しました。

English

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

PaLI-X: 多言語視覚と言語モデルのスケールアップに関する研究

PaLI-X: On Scaling up a Multilingual Vision and Language Model

要旨

Support