使用 RingAttention 在百萬長度的影片和語言上的世界模型

摘要

目前的語言模型在理解無法輕易用文字描述的世界方面存在不足，並且在處理複雜、長格式任務時遇到困難。視頻序列提供了在語言和靜態圖像中缺失的寶貴時間信息，使其成為與語言聯合建模的吸引力所在。這樣的模型可以發展對人類文本知識和物理世界的理解，從而擴大AI協助人類的能力。然而，從數百萬個視頻和語言序列中學習面臨記憶限制、計算複雜性和有限數據集等挑戰。為了應對這些挑戰，我們匯編了一個包含多樣視頻和書籍的大型數據集，利用RingAttention技術可擴展地訓練長序列，並逐步將上下文大小從4K擴展至1M標記。本文的貢獻如下：(a) 最大上下文大小神經網絡：我們在長視頻和語言序列上訓練了一個擁有最大上下文大小的變壓器，為困難的檢索任務和長視頻理解設立了新的基準。(b) 克服視覺-語言訓練挑戰的解決方案，包括使用遮罩序列打包來混合不同序列長度、損失加權以平衡語言和視覺，以及模型生成的QA數據集用於長序列對話。(c) 通過RingAttention、遮罩序列打包和其他關鍵功能進行高度優化的實現，用於在數百萬長度的多模式序列上進行訓練。(d) 完全開源的一系列能夠處理長文檔（LWM-Text、LWM-Text-Chat）和視頻（LWM、LWM-Chat）中超過1M標記的70億參數模型家族。這項工作為在大規模視頻和語言數據集上進行訓練，以發展對人類知識和多模式世界的理解以及更廣泛的能力鋪平了道路。

English

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

使用 RingAttention 在百萬長度的影片和語言上的世界模型

World Model on Million-Length Video And Language With RingAttention

摘要

Support