长文写作大师Zero:通过强化学习掌握超长文本生成
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
June 23, 2025
作者: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
cs.AI
摘要
大语言模型(LLMs)的超长文本生成是一个广受需求的应用场景,但由于其最大生成长度限制以及随着序列增长而出现的整体质量下降,这仍然是一个重大挑战。以往的方法,如LongWriter,通常依赖于“教学”策略,即对合成的长文本输出进行监督微调(SFT)。然而,这一策略严重依赖于合成的SFT数据,这些数据不仅构建困难且成本高昂,还常常缺乏连贯性和一致性,显得过于人工化且结构单一。在本研究中,我们提出了一种基于激励的方法,完全从零开始,不依赖任何标注或合成数据,而是利用强化学习(RL)来促进LLMs发展出超长高质量文本生成的能力。我们从基础模型出发,类似于R1-Zero,进行RL训练,引导其在写作过程中进行有助于规划和优化的推理。为此,我们采用了专门的奖励模型,以引导LLM在长度控制、写作质量和结构格式化方面取得进步。实验评估表明,基于Qwen2.5-32B训练的LongWriter-Zero模型在长文本写作任务上持续超越传统的SFT方法,在WritingBench和Arena-Write的所有指标上均达到了最先进的水平,甚至超越了DeepSeek R1和Qwen3-235B等100B+模型。我们已在https://huggingface.co/THU-KEG/LongWriter-Zero-32B开源了我们的数据和模型检查点。
English
Ultra-long generation by large language models (LLMs) is a widely demanded
scenario, yet it remains a significant challenge due to their maximum
generation length limit and overall quality degradation as sequence length
increases. Previous approaches, exemplified by LongWriter, typically rely on
''teaching'', which involves supervised fine-tuning (SFT) on synthetic
long-form outputs. However, this strategy heavily depends on synthetic SFT
data, which is difficult and costly to construct, often lacks coherence and
consistency, and tends to be overly artificial and structurally monotonous. In
this work, we propose an incentivization-based approach that, starting entirely
from scratch and without relying on any annotated or synthetic data, leverages
reinforcement learning (RL) to foster the emergence of ultra-long, high-quality
text generation capabilities in LLMs. We perform RL training starting from a
base model, similar to R1-Zero, guiding it to engage in reasoning that
facilitates planning and refinement during the writing process. To support
this, we employ specialized reward models that steer the LLM towards improved
length control, writing quality, and structural formatting. Experimental
evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B,
consistently outperforms traditional SFT methods on long-form writing tasks,
achieving state-of-the-art results across all metrics on WritingBench and
Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and
Qwen3-235B. We open-source our data and model checkpoints under
https://huggingface.co/THU-KEG/LongWriter-Zero-32B