PaLI-X：关于扩展多语言视觉与语言模型

摘要

我们介绍了扩展PaLI-X的训练配方和结果，这是一个多语言视觉与语言模型，无论是在组件规模还是训练任务混合的广度方面都有所提升。我们的模型在各种各样且复杂的任务上取得了新的性能水平，包括多个基于图像的字幕生成和问答任务、基于图像的文档理解和少样本（上下文内）学习，以及目标检测、视频问答和视频字幕生成。PaLI-X在大多数视觉与语言基准测试中取得了最新的技术水平（25+个）。最后，我们观察到新兴的能力，例如复杂计数和多语言目标检测，这些任务并未明确包含在训练中。

English

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

PaLI-X：关于扩展多语言视觉与语言模型

PaLI-X: On Scaling up a Multilingual Vision and Language Model

摘要

Support