使用RingAttention的世界模型在百万长度的视频和语言上

摘要

当前的语言模型在理解那些难以用文字描述的世界方面存在不足，并且在处理复杂、长格式任务时遇到困难。视频序列提供了在语言和静态图像中缺失的宝贵时间信息，使其与语言联合建模变得有吸引力。这样的模型可以发展对人类文本知识和物理世界的理解，从而为辅助人类提供更广泛的人工智能能力。然而，从数百万个视频和语言序列中学习面临着由于内存限制、计算复杂性和有限数据集而带来的挑战。为了解决这些挑战，我们策划了一个包含多样视频和书籍的大型数据集，利用RingAttention技术可扩展地训练长序列，并逐渐将上下文大小从4K增加到1M个标记。本文作出以下贡献：(a) 最大上下文大小神经网络：我们在长视频和语言序列上训练了一个具有最大上下文大小的transformer，为困难的检索任务和长视频理解设立了新的基准。(b) 克服视觉-语言训练挑战的解决方案，包括使用掩码序列打包以混合不同序列长度，损失加权以平衡语言和视觉，以及模型生成的长序列聊天问答数据集。(c) 通过RingAttention、掩码序列打包和其他关键特性进行高度优化的实现，用于在长度为数百万的多模态序列上进行训练。(d) 完全开源的一系列能够处理长文档（LWM-Text、LWM-Text-Chat）和超过1M标记的视频（LWM、LWM-Chat）的70亿参数模型家族。这项工作为在大规模视频和语言数据集上进行训练，以发展对人类知识和多模态世界的理解以及更广泛的能力铺平了道路。

English

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

使用RingAttention的世界模型在百万长度的视频和语言上

World Model on Million-Length Video And Language With RingAttention

摘要

Support