LongVILA：为长视频扩展长上下文视觉语言模型

摘要

长上下文能力对于多模态基础模型至关重要。我们介绍了LongVILA，这是一个针对长上下文视觉-语言模型的全栈解决方案，包括系统、模型训练和数据集开发。在系统方面，我们引入了第一个支持长上下文训练和推断的多模态序列并行（MM-SP）系统，可以在256个GPU上进行2M上下文长度的训练。MM-SP还具有高效性，比环形式序列并行快2.1倍至5.7倍，比Megatron-LM在仅文本设置下快1.1倍至1.4倍。此外，它与Hugging Face Transformers完美集成。对于模型训练，我们提出了一个包括对齐、预训练、上下文扩展和长短联合监督微调的五阶段流水线。关于数据集，我们精心构建了大规模视觉语言预训练数据集和长视频指令跟随数据集，以支持我们的多阶段训练过程。这一全栈解决方案将VILA的可行帧数扩展了128倍（从8帧到1024帧），将长视频字幕评分从2.00提高到3.26（1.6倍），在1400帧视频（274k上下文长度）的“大海捞针”任务中实现了99.5%的准确率。LongVILA-8B还在VideoMME基准测试中表现出对长视频性能的持续改进，随着视频帧数的增加。

English

Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP) system that enables long-context training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, context extension, and long-short joint supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates a consistent improvement in performance on long videos within the VideoMME benchmark as the video frames increase.

LongVILA：为长视频扩展长上下文视觉语言模型

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

摘要

Support