長文寫手:從長文本語言模型中釋放超過10,000字的生成
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
August 13, 2024
作者: Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
cs.AI
摘要
目前的大型語言模型(LLMs)可以處理長達 100,000 個標記的輸入,但卻難以生成超過 2,000 個字的輸出。通過受控實驗,我們發現模型的有效生成長度在於其在監督微調(SFT)期間見過的樣本所限制。換句話說,它們的輸出限制是由現有 SFT 數據集中長輸出示例的稀缺性所致。為了解決這個問題,我們引入了 AgentWrite,這是一個基於代理的流程,將超長生成任務分解為子任務,使現成的 LLMs 能夠生成連貫的超過 20,000 個字的輸出。利用 AgentWrite,我們構建了 LongWriter-6k 數據集,其中包含 6,000 個 SFT 數據,輸出長度範圍從 2k 到 32k 個字不等。通過將這個數據集納入模型訓練,我們成功地將現有模型的輸出長度擴展到超過 10,000 個字,同時保持輸出質量。我們還開發了 LongBench-Write,這是一個全面的基準測試,用於評估超長生成能力。我們的 9B 參數模型,通過 DPO 進一步改進,在這個基準測試中實現了最先進的性能,甚至超過了更大的專有模型。總的來說,我們的工作表明,現有的長上下文 LLM 已經具備了更大輸出窗口的潛力--您所需要的就是在模型對齊期間具有擴展輸出的數據,以解鎖這一能力。我們的代碼和模型位於:https://github.com/THUDM/LongWriter。
English
Current long context large language models (LLMs) can process inputs up to
100,000 tokens, yet struggle to generate outputs exceeding even a modest length
of 2,000 words. Through controlled experiments, we find that the model's
effective generation length is inherently bounded by the sample it has seen
during supervised fine-tuning (SFT). In other words, their output limitation is
due to the scarcity of long-output examples in existing SFT datasets. To
address this, we introduce AgentWrite, an agent-based pipeline that decomposes
ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to
generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we
construct LongWriter-6k, a dataset containing 6,000 SFT data with output
lengths ranging from 2k to 32k words. By incorporating this dataset into model
training, we successfully scale the output length of existing models to over
10,000 words while maintaining output quality. We also develop LongBench-Write,
a comprehensive benchmark for evaluating ultra-long generation capabilities.
Our 9B parameter model, further improved through DPO, achieves state-of-the-art
performance on this benchmark, surpassing even much larger proprietary models.
In general, our work demonstrates that existing long context LLM already
possesses the potential for a larger output window--all you need is data with
extended output during model alignment to unlock this capability. Our code &
models are at: https://github.com/THUDM/LongWriter.Summary
AI-Generated Summary