ViTに語らせる：生成的言語-画像事前学習

要旨

本論文では、マルチモーダル大規模言語モデル（MLLM）向けに設計されたVision Transformer（ViT）のためのミニマリスト生成的事前学習フレームワーク「Generative Language-Image Pre-training（GenLIP）」を提案する。視覚エンコーダをLLMの自己回帰的特性に適合させるため、GenLIPは対照的なバッチ構築や追加のテキストデコーダを必要とせず、標準的な言語モデリング目標を用いてViTが視覚トークンから直接言語トークンを予測するように訓練する。この設計には3つの利点がある：（1）単一のトランスフォーマーが視覚・テキストトークンを共同モデリングする簡素さ、（2）データ量とモデル規模の両方で効果的にスケーリングする拡張性、（3）多様なマルチモーダルベンチマークで競争力ある優れた結果を達成する性能である。Recap-DataComp-1Bから抽出した80億サンプルで学習したGenLIPは、事前学習データ量を大幅に削減しているにもかかわらず、強力なベースラインを匹敵または凌駕する。さらに原生アスペクト比のマルチ解像度画像で継続事前学習を行うことで、OCRや図表理解などの詳細敏感タスクにおいて性能が向上し、MLLMにおける視覚エンコーダの強固な基盤となっている。

English

In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

ViTに語らせる：生成的言語-画像事前学習

Let ViT Speak: Generative Language-Image Pre-training

要旨

Support