LongDPO：批評による段階的情報を介してLLMの長い形式生成能力を向上させる

要旨

長文生成は学術論文やリポジトリレベルのコード生成において重要です。それにもかかわらず、現在のモデル、GPT-4oを含む、まだ満足できる性能を示していません。既存の手法は、結果の監督を利用する好み学習を行っているにもかかわらず、拡張された文脈に対する詳細なフィードバックを提供することができず、クエリ要件を十分に満たさないコンテンツや長さの逸脱、品質の低下などの問題が生じる可能性があります。本論文では、プロセスの監督を取り入れることで長文生成を向上させることを提案します。Monte Carlo Tree Searchを用いて段階的な好みのペアを収集し、一貫性を保つためにグローバルメモリプールを利用します。最適でない候補の選択の問題に対処するために、外部の批評を統合して好みのペアの品質を洗練し改善します。最後に、収集した段階的な好みのペアを用いて段階レベルのDPOを適用します。実験結果は、当社の手法が長文生成のベンチマークにおいて長さと品質を向上させ、様々なモデルのバックボーンにおいて一般的なベンチマークにおいてほぼ損失のないパフォーマンスを示すことを示しています。

English

Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.

LongDPO：批評による段階的情報を介してLLMの長い形式生成能力を向上させる

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

要旨

Support