超级写手:基于大语言模型的反思驱动长文本生成
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
June 4, 2025
作者: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee
cs.AI
摘要
长文本生成对于大型语言模型(LLMs)而言仍是一项重大挑战,尤其是在保持连贯性、确保逻辑一致性以及随着序列长度增加维持文本质量方面。为应对这些局限,我们提出了SuperWriter-Agent,一个基于代理的框架,旨在提升长文本生成的质量与一致性。SuperWriter-Agent通过引入明确的规划与精炼阶段,将结构化思维过程融入生成流程,引导模型遵循更为审慎且认知基础扎实的创作过程,类似于专业作家的写作方式。基于此框架,我们构建了一个监督微调数据集,用于训练一个7B参数的SuperWriter-LM模型。此外,我们开发了一种分层直接偏好优化(DPO)方法,利用蒙特卡洛树搜索(MCTS)传播最终质量评估,并据此优化每一步生成。在多样化的基准测试中,实证结果表明,SuperWriter-LM实现了最先进的性能,在自动评估和人工评估中均超越了更大规模的基线模型。同时,全面的消融研究验证了分层DPO的有效性,并强调了引入结构化思维步骤对于提升长文本生成质量的价值。
English
Long-form text generation remains a significant challenge for large language
models (LLMs), particularly in maintaining coherence, ensuring logical
consistency, and preserving text quality as sequence length increases. To
address these limitations, we propose SuperWriter-Agent, an agent-based
framework designed to enhance the quality and consistency of long-form text
generation. SuperWriter-Agent introduces explicit structured thinking-through
planning and refinement stages into the generation pipeline, guiding the model
to follow a more deliberate and cognitively grounded process akin to that of a
professional writer. Based on this framework, we construct a supervised
fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a
hierarchical Direct Preference Optimization (DPO) procedure that uses Monte
Carlo Tree Search (MCTS) to propagate final quality assessments and optimize
each generation step accordingly. Empirical results across diverse benchmarks
demonstrate that SuperWriter-LM achieves state-of-the-art performance,
surpassing even larger-scale baseline models in both automatic evaluation and
human evaluation. Furthermore, comprehensive ablation studies demonstrate the
effectiveness of hierarchical DPO and underscore the value of incorporating
structured thinking steps to improve the quality of long-form text generation.