LongWriter-Zero: 강화 학습을 통한 초장문 텍스트 생성 기술의 정복

초록

대형 언어 모델(LLM)의 초장문 생산은 널리 요구되는 시나리오이지만, 최대 생성 길이 제한과 시퀀스 길이가 증가함에 따른 전반적인 품질 저하로 인해 여전히 중요한 과제로 남아 있습니다. LongWriter와 같은 기존 접근 방식은 일반적으로 합성된 장문 출력에 대한 지도 미세 조정(SFT)을 포함하는 '가르침'에 의존합니다. 그러나 이 전략은 합성 SFT 데이터에 크게 의존하며, 이는 구축하기 어렵고 비용이 많이 들며, 종종 일관성과 통일성이 부족하고, 지나치게 인위적이고 구조적으로 단조로운 경향이 있습니다. 본 연구에서는 주석 처리된 데이터나 합성 데이터에 전혀 의존하지 않고, 처음부터 시작하여 강화 학습(RL)을 활용하여 LLM에서 초장문 고품질 텍스트 생성 능력이 발현되도록 유도하는 인센티브 기반 접근 방식을 제안합니다. R1-Zero와 유사한 기본 모델에서 시작하여 RL 훈련을 수행하며, 이를 통해 작성 과정에서 계획 및 개선을 촉진하는 추론에 참여하도록 유도합니다. 이를 지원하기 위해, LLM이 향상된 길이 제어, 작성 품질 및 구조적 형식화를 향하도록 유도하는 특수 보상 모델을 사용합니다. 실험 평가 결과, Qwen2.5-32B에서 훈련된 우리의 LongWriter-Zero 모델은 장문 작성 작업에서 전통적인 SFT 방법을 지속적으로 능가하며, WritingBench와 Arena-Write에서 모든 지표에서 최신 기술을 달성하고, DeepSeek R1 및 Qwen3-235B와 같은 100B+ 모델을 능가하는 결과를 보여줍니다. 우리는 데이터와 모델 체크포인트를 https://huggingface.co/THU-KEG/LongWriter-Zero-32B에서 오픈소스로 공개합니다.

English

Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B

LongWriter-Zero: 강화 학습을 통한 초장문 텍스트 생성 기술의 정복

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

초록

Support