ChatPaper.aiChatPaper

共同強化語言模型生成的多樣性與品質

Jointly Reinforcing Diversity and Quality in Language Model Generations

September 2, 2025
作者: Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang
cs.AI

摘要

大型語言模型(LMs)的後期訓練往往優先考慮準確性和幫助性,而犧牲了多樣性。這造成了一種矛盾:雖然後期訓練提高了回應質量,但它也使得輸出分佈更加集中,減少了想法的範圍,從而限制了LMs在創意和探索性任務(如頭腦風暴、故事創作或問題解決)中的實用性。我們通過多樣性感知強化學習(Diversity-Aware Reinforcement Learning, DARLING)框架來應對這一挑戰,該框架同時優化回應質量和語義多樣性。DARLING的核心在於引入了一種學習的分區函數,用於衡量超越表面詞彙變化的多樣性。這一多樣性信號隨後與在線強化學習中的質量獎勵相結合,鼓勵模型生成既高質量又獨特的輸出。跨多個模型家族和規模的實驗表明,DARLING在兩種情境下均表現出良好的泛化能力:不可驗證任務(指令遵循和創意寫作)和可驗證任務(競賽數學)。在第一種情境下的五個基準測試中,DARLING始終優於僅關注質量的強化學習基線,生成的輸出在質量和新穎性上均更勝一籌。在第二種情境下,DARLING實現了更高的pass@1(解決方案質量)和pass@k(解決方案多樣性)。最引人注目的是,明確優化多樣性促進了在線強化學習中的探索,這表現為更高質量的回應。
English
Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.
PDF221September 3, 2025