ViT가 말하게 하라: 생성형 언어-이미지 사전 학습

초록

본 논문에서는 멀티모달 대규모 언어 모델(MLLM)을 위한 Vision Transformer(ViT)의 미니멀리스트 생성 사전 학습 프레임워크인 생성적 언어-이미지 사전 학습(GenLIP)을 제안한다. 비전 인코더와 LLM의 자기회귀적 특성을 보다 효과적으로 정렬하기 위해, GenLIP는 대조적 배치 구성이나 추가 텍스트 디코더 없이 표준 언어 모델링 목표를 사용하여 ViT가 시각 토큰으로부터 직접 언어 토큰을 예측하도록 학습시킨다. 이 설계는 세 가지 핵심 장점을 제공한다: (1) 단일 트랜스포머가 시각 및 텍스트 토큰을 공동으로 모델링하는 단순성, (2) 데이터 및 모델 크기 양측에서 효과적으로 확장되는 확장성, (3) 다양한 멀티모달 벤치마크에서 경쟁력 있거나 우수한 성능을 달성하는 성능. Recap-DataComp-1B의 8B 샘플로 학습된 GenLIP는 상당히 적은 사전 학습 데이터를 사용함에도 강력한 기준 모델들을 따라잡거나 능가한다. 기본 종횡비의 다중 해상도 이미지에 대한 지속적 사전 학습 후, GenLIP는 OCR 및 차트 이해와 같은 세부 정보 민감 작업에서 추가로 개선되어 MLLM의 비전 인코더를 위한 견고한 기반을 제공한다.

English

In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

ViT가 말하게 하라: 생성형 언어-이미지 사전 학습

Let ViT Speak: Generative Language-Image Pre-training

초록

Support