每個問題皆有價值:基於明確人類價值的強化學習
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
October 23, 2025
作者: Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
cs.AI
摘要
我們提出「基於顯式人類價值的強化學習」(RLEV),該方法將大型語言模型(LLM)的最佳化過程直接與可量化的人類價值信號對齊。雖然「可驗證獎勵的強化學習」(RLVR)能透過二元正確性獎勵在客觀領域有效訓練模型,但其忽略了不同任務的重要性存在差異。RLEV通過將人類定義的價值信號直接整合至獎勵函數中,擴展了此框架。使用帶有明確真實價值標籤的考試型數據時,RLEV在多種強化學習演算法與模型規模下均持續超越僅基於正確性的基準方法。關鍵在於,RLEV策略不僅提升價值加權準確率,更學會價值敏感的終止策略:對低價值提示簡潔回應,對高價值提示深入闡釋。我們證明此行為源於序列結束符號上的價值加權梯度放大。消融實驗證實效益與價值對齊存在因果關聯。即使在噪聲價值信號(如基於難度的標籤)下,RLEV仍保持穩健性,表明針對顯式效用函數的最佳化為LLM與人類優先事項對齊提供了可行路徑。
English
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method
that aligns Large Language Model (LLM) optimization directly with quantifiable
human value signals. While Reinforcement Learning with Verifiable Rewards
(RLVR) effectively trains models in objective domains using binary correctness
rewards, it overlooks that not all tasks are equally significant. RLEV extends
this framework by incorporating human-defined value signals directly into the
reward function. Using exam-style data with explicit ground-truth value labels,
RLEV consistently outperforms correctness-only baselines across multiple RL
algorithms and model scales. Crucially, RLEV policies not only improve
value-weighted accuracy but also learn a value-sensitive termination policy:
concise for low-value prompts, thorough for high-value ones. We demonstrate
this behavior stems from value-weighted gradient amplification on
end-of-sequence tokens. Ablation studies confirm the gain is causally linked to
value alignment. RLEV remains robust under noisy value signals, such as
difficulty-based labels, demonstrating that optimizing for an explicit utility
function offers a practical path to aligning LLMs with human priorities.