LongDPO:通過批判性增強的分步信息來解鎖LLM更好的長文生成能力。
LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information
February 4, 2025
作者: Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang
cs.AI
摘要
長文生成對於學術寫作論文和程式碼生成在存儲庫級別上至關重要。儘管如此,包括 GPT-4o 在內的目前模型仍然表現不佳。現有方法利用偏好學習和結果監督,往往無法提供對於延長上下文的詳細反饋。這個缺陷可能導致內容未能完全滿足查詢要求,導致長度偏差和品質下降等問題。本文提出通過納入過程監督來增強長文生成。我們採用蒙特卡羅樹搜索來收集逐步偏好對,利用全局記憶池來保持一致性。為了解決次優候選選擇的問題,我們整合外部評論來完善和提高偏好對的品質。最後,我們應用收集的逐步偏好對來進行步級 DPO。實驗結果顯示,我們的方法在長文生成基準測試中改善了長度和品質,在各種模型骨幹上的一般基準測試中幾乎沒有損失性能。
English
Long-form generation is crucial for academic writing papers and repo-level
code generation. Despite this, current models, including GPT-4o, still exhibit
unsatisfactory performance. Existing methods that utilize preference learning
with outcome supervision often fail to provide detailed feedback for extended
contexts. This shortcoming can lead to content that does not fully satisfy
query requirements, resulting in issues like length deviations, and diminished
quality. In this paper, we propose enhancing long-form generation by
incorporating process supervision. We employ Monte Carlo Tree Search to gather
stepwise preference pairs, utilizing a global memory pool to maintain
consistency. To address the issue of suboptimal candidate selection, we
integrate external critiques to refine and improve the quality of the
preference pairs. Finally, we apply step-level DPO using the collected stepwise
preference pairs. Experimental results show that our method improves length and
quality on long-form generation benchmarks, with almost lossless performance on
general benchmarks across various model backbones.Summary
AI-Generated Summary