言語モデル生成における多様性と品質の共同強化

要旨

大規模言語モデル（LM）のポストトレーニングでは、多様性を犠牲にして正確性や有用性が優先されることが多い。これにより、ポストトレーニングが応答品質を向上させる一方で、出力分布が鋭くなり、アイデアの範囲が狭まるという緊張関係が生じる。その結果、ブレインストーミング、ストーリーテリング、問題解決などの創造的・探索的タスクにおけるLMの有用性が制限される。本研究では、この課題に対処するため、応答品質と意味的多様性を同時に最適化するフレームワークである「多様性を考慮した強化学習（Diversity-Aware Reinforcement Learning, DARLING）」を提案する。DARLINGの中核では、表面的な語彙の変化を超えた多様性を測定するために学習された分割関数を導入する。この多様性信号は、オンライン強化学習中に品質報酬と組み合わされ、モデルが高品質かつ独自性のある出力を生成するよう促す。複数のモデルファミリーとサイズにわたる実験を通じて、DARLINGが2つの領域に一般化することが示された：検証不可能なタスク（指示追従や創造的ライティング）と検証可能なタスク（競技数学）である。前者の設定における5つのベンチマークでは、DARLINGは品質のみを最適化した強化学習ベースラインを一貫して上回り、より高品質で新奇性のある出力を生成した。後者の設定では、DARLINGはpass@1（解決策の品質）とpass@k（解決策の多様性）の両方で高い成果を達成した。最も注目すべきは、多様性を明示的に最適化することがオンライン強化学習における探索を促進し、それがより高品質な応答として現れる点である。

English

Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

言語モデル生成における多様性と品質の共同強化

Jointly Reinforcing Diversity and Quality in Language Model Generations

要旨

Support