SuperWriter: 大規模言語モデルを用いたリフレクション駆動型長文生成

要旨

長文生成は、大規模言語モデル（LLM）にとって依然として重要な課題であり、特に一貫性の維持、論理的一貫性の確保、およびシーケンス長が増加するにつれてテキスト品質を保つことが困難です。これらの制約に対処するため、我々はSuperWriter-Agentを提案します。これは、長文生成の品質と一貫性を向上させるために設計されたエージェントベースのフレームワークです。SuperWriter-Agentは、生成パイプラインに明示的な構造化された思考プロセスを導入し、計画と洗練の段階を組み込むことで、モデルがプロの作家のような慎重で認知的に基づいたプロセスをたどるよう導きます。このフレームワークに基づいて、我々は7BのSuperWriter-LMを訓練するための教師ありファインチューニングデータセットを構築します。さらに、モンテカルロ木探索（MCTS）を使用して最終的な品質評価を伝播し、各生成ステップを最適化する階層的な直接選好最適化（DPO）手順を開発します。多様なベンチマークでの実験結果は、SuperWriter-LMが最先端の性能を達成し、自動評価と人間評価の両方において、より大規模なベースラインモデルを凌駕することを示しています。さらに、包括的なアブレーション研究は、階層的DPOの有効性を実証し、構造化された思考ステップを組み込むことが長文生成の品質を向上させる上で価値があることを強調しています。

English

Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.