4K解像度への視覚事前学習のスケーリング

要旨

高解像度での視覚的詳細の知覚は、日常的なタスクにおいて極めて重要である。しかし、現在の視覚事前学習は、より大きな画像を処理する際の二次的なコストのため、低解像度（例：378 x 378ピクセル）に限定されている。本論文では、PS3を導入し、CLIPスタイルの視覚事前学習を4K解像度にスケールアップしつつ、ほぼ一定のコストを維持する。PS3は、グローバルな画像表現に対するコントラスティブ学習ではなく、局所的な領域を選択的に処理し、それらを局所的な詳細なキャプションと対比させることで事前学習を行い、大幅に削減された計算オーバーヘッドで高解像度の表現学習を可能にする。事前学習されたPS3は、低解像度でグローバルな画像をエンコードするだけでなく、その顕著性やテキストプロンプトとの関連性に基づいて、局所的な高解像度領域を選択的に処理することができる。PS3をマルチモーダルLLM（MLLM）に適用した結果、VILA-HDと名付けられたモデルは、AnyResやS^2などの高解像度視覚事前学習を行わないベースラインと比較して、高解像度視覚知覚を大幅に改善し、最大4.3倍少ないトークンを使用する。PS3はまた、VILA-HDの魅力的なスケーリング特性を解き放ち、解像度を無料でスケールアップすることや、テスト時の計算量を増やして性能を向上させることを可能にする。最新技術と比較して、VILA-HDは、NVILAやQwen2-VLなどの従来のMLLMを複数のベンチマークで上回り、最新のトークンプルーニング手法よりも優れた効率を達成する。最後に、現在のベンチマークでは4K解像度の知覚が必要とされていないことがわかり、これが4K解像度での画像QAの新しいベンチマークである4KProを提案する動機となった。4KProにおいて、VILA-HDは、GPT-4oに対して14.5%、Qwen2-VLに対して3.2%の改善と2.96倍の高速化を達成し、すべての従来のMLLMを上回る性能を示した。

English

High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

4K解像度への視覚事前学習のスケーリング

Scaling Vision Pre-Training to 4K Resolution

要旨

Support