长文作者：从长上下文LLMs中释放1万字以上的生成

摘要

当前的大型语言模型（LLMs）可以处理长达100,000个标记的输入，但很难生成超过甚至是2000个字的输出。通过受控实验，我们发现模型的有效生成长度受其在监督微调（SFT）期间所见样本的固有限制。换句话说，它们的输出限制是由现有SFT数据集中长输出示例的稀缺性造成的。为了解决这个问题，我们引入了AgentWrite，这是一个基于代理的流水线，将超长生成任务分解为子任务，使现成的LLMs能够生成连贯的超过20,000个字的输出。利用AgentWrite，我们构建了LongWriter-6k数据集，其中包含6,000个SFT数据，输出长度从2k到32k字不等。通过将这个数据集纳入模型训练，我们成功地将现有模型的输出长度扩展到超过10,000个字，同时保持输出质量。我们还开发了LongBench-Write，这是一个用于评估超长生成能力的全面基准。我们的90亿参数模型，通过DPO进一步改进，在这一基准测试中取得了最先进的性能，甚至超过了更大的专有模型。总的来说，我们的工作表明，现有的长上下文LLM已经具备了更大输出窗口的潜力--您所需要的只是在模型对齐期间具有扩展输出的数据来释放这种能力。我们的代码和模型位于: https://github.com/THUDM/LongWriter。

English

Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability. Our code & models are at: https://github.com/THUDM/LongWriter.

长文作者：从长上下文LLMs中释放1万字以上的生成

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

摘要

Support