PaLI-X: 다국어 비전 및 언어 모델의 확장에 관하여

초록

우리는 다국어 비전 및 언어 모델인 PaLI-X의 컴포넌트 크기와 훈련 작업의 다양성 측면에서 확장된 훈련 레시피와 결과를 제시합니다. 우리의 모델은 다중 이미지 기반 캡셔닝 및 질의응답 작업, 이미지 기반 문서 이해, 소수 샷(컨텍스트 내) 학습뿐만 아니라 객체 탐지, 비디오 질의응답, 비디오 캡셔닝 등 다양한 복잡한 작업에서 새로운 수준의 성능을 달성합니다. PaLI-X는 고려된 대부분의 비전 및 언어 벤치마크(25개 이상)에서 최첨단 기술을 발전시킵니다. 마지막으로, 복잡한 카운팅 및 다국어 객체 탐지와 같이 훈련 작업에 명시적으로 포함되지 않은 작업에서도 새로운 능력이 나타나는 것을 관찰합니다.

English

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

PaLI-X: 다국어 비전 및 언어 모델의 확장에 관하여

PaLI-X: On Scaling up a Multilingual Vision and Language Model

초록

Support