LongVILA: 長尺動画向け長文脈視覚言語モデルのスケーリング

要旨

長文脈対応能力はマルチモーダル基盤モデルにとって極めて重要です。本論文では、長文脈視覚言語モデルのためのフルスタックソリューションであるLongVILAを紹介します。これにはシステム、モデルトレーニング、データセット開発が含まれます。システム面では、256GPUで200万トークンの文脈長トレーニングを可能にする初のマルチモーダルシーケンス並列処理（MM-SP）システムを提案します。MM-SPは効率的で、Ring-Styleシーケンス並列処理よりも2.1～5.7倍、テキストのみの設定ではMegatron-LMよりも1.1～1.4倍高速です。さらに、Hugging Face Transformersとシームレスに統合されます。モデルトレーニングでは、アライメント、事前学習、文脈拡張、長短結合教師ありファインチューニングからなる5段階パイプラインを提案します。データセットに関しては、大規模な視覚言語事前学習データセットと長尺ビデオ指示追従データセットを慎重に構築し、多段階トレーニングプロセスをサポートします。このフルスタックソリューションにより、VILAの実現可能フレーム数が128倍（8フレームから1024フレーム）に拡張され、長尺ビデオキャプショニングスコアが2.00から3.26（1.6倍）に向上し、1400フレーム（27万4千トークン文脈長）の「干し草の山の中の針」タスクで99.5%の精度を達成しました。また、LongVILA-8Bは、VideoMMEベンチマークにおいて、ビデオフレーム数が増えるにつれて長尺ビデオのパフォーマンスが一貫して向上することを示しています。

English

Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP) system that enables long-context training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, context extension, and long-short joint supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates a consistent improvement in performance on long videos within the VideoMME benchmark as the video frames increase.

LongVILA: 長尺動画向け長文脈視覚言語モデルのスケーリング

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

要旨

Support