PaLI-3 비전 언어 모델: 더 작고, 더 빠르며, 더 강력한

초록

본 논문은 기존 유사 모델들보다 10배 더 큰 규모임에도 불구하고 더 작고 빠르며 강력한 성능을 보이는 시각 언어 모델(VLM)인 PaLI-3를 소개한다. 이러한 강력한 성능을 달성하기 위해, 우리는 분류 목적으로 사전 학습된 Vision Transformer(ViT) 모델과 대조적(SigLIP) 사전 학습 모델을 비교하였다. 분석 결과, SigLIP 기반 PaLI는 표준 이미지 분류 벤치마크에서는 약간의 성능 저하를 보이지만, 다양한 멀티모달 벤치마크, 특히 지역화 및 시각적 상황 이해 작업에서 우수한 성능을 보였다. 우리는 SigLIP 이미지 인코더를 20억 개의 파라미터로 확장하여 다국어 교차 모달 검색 분야에서 새로운 최첨단 기술을 달성하였다. 단 50억 개의 파라미터로 구성된 PaLI-3가 복잡한 VLM의 기본 요소에 대한 연구를 재점화하고, 새로운 세대의 대규모 모델 개발에 기여할 수 있기를 기대한다.

English

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

PaLI-3 비전 언어 모델: 더 작고, 더 빠르며, 더 강력한

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

초록

Support