LongVILA:為長視頻擴展長上下文視覺語言模型
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
August 19, 2024
作者: Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
cs.AI
摘要
長篇文本能力對於多模態基礎模型至關重要。我們介紹了 LongVILA,這是一個針對長篇文本視覺-語言模型的全套解決方案,包括系統、模型訓練和數據集開發。在系統方面,我們引入了第一個多模態序列並行(MM-SP)系統,實現了長篇文本的訓練和推理,使得在 256 個 GPU 上能夠進行 2M 文本長度的訓練。MM-SP 還具有高效性,比環形式序列並行快 2.1 倍至 5.7 倍,比 Megatron-LM 在僅文本設置下快 1.1 倍至 1.4 倍。此外,它與 Hugging Face Transformers 無縫集成。對於模型訓練,我們提出了一個五階段流水線,包括對齊、預訓練、文本擴展和長短聯合監督微調。在數據集方面,我們精心構建了大規模視覺語言預訓練數據集和長視頻指令跟隨數據集,以支持我們的多階段訓練過程。這個全套解決方案將 VILA 的可行幀數增加了 128 倍(從 8 增加到 1024 幀),並將長視頻字幕得分從 2.00 提升至 3.26(1.6 倍),在 1400 幀視頻(274k 文本長度)中實現 99.5% 的準確性。LongVILA-8B 在 VideoMME 基準測試中隨著視頻幀數增加,展示了在長視頻上性能持續改善的一致性。
English
Long-context capability is critical for multi-modal foundation models. We
introduce LongVILA, a full-stack solution for long-context vision-language
models, including system, model training, and dataset development. On the
system side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP)
system that enables long-context training and inference, enabling 2M context
length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster
than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in
text-only settings. Moreover, it seamlessly integrates with Hugging Face
Transformers. For model training, we propose a five-stage pipeline comprising
alignment, pre-training, context extension, and long-short joint supervised
fine-tuning. Regarding datasets, we meticulously construct large-scale visual
language pre-training datasets and long video instruction-following datasets to
support our multi-stage training process. The full-stack solution extends the
feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and
improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5%
accuracy in 1400-frames video (274k context length) needle in a haystack.
LongVILA-8B also demonstrates a consistent improvement in performance on long
videos within the VideoMME benchmark as the video frames increase.Summary
AI-Generated Summary