ChatPaper.aiChatPaper

协同增强语言模型生成中的多样性与质量

Jointly Reinforcing Diversity and Quality in Language Model Generations

September 2, 2025
作者: Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang
cs.AI

摘要

大型语言模型(LMs)的后训练往往优先考虑准确性和实用性,而牺牲了多样性。这引发了一种矛盾:虽然后训练提升了回答质量,但也使得输出分布更加集中,减少了想法的范围,从而限制了LMs在创意和探索性任务(如头脑风暴、故事叙述或问题解决)中的实用性。我们通过多样性感知强化学习(DARLING)框架应对这一挑战,该框架同时优化回答质量和语义多样性。DARLING的核心在于引入了一种学习到的分区函数,用以衡量超越表层词汇变化的多样性。这一多样性信号随后与在线强化学习中的质量奖励相结合,激励模型生成既高质量又具独特性的输出。跨多个模型家族和规模的实验表明,DARLING适用于两种任务类型:不可验证任务(指令遵循和创意写作)和可验证任务(竞赛数学)。在第一种设置下的五个基准测试中,DARLING始终优于仅关注质量的强化学习基线,产出了同时具备更高质量和新颖性的输出。在第二种设置下,DARLING实现了更高的pass@1(解决方案质量)和pass@k(解决方案多样性)。最为显著的是,明确优化多样性在在线强化学习中催化了探索,表现为更高质量的响应。
English
Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.
PDF221September 3, 2025