ChatPaper.aiChatPaper

每個問題皆有價值:基於明確人類價值的強化學習

Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

October 23, 2025
作者: Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
cs.AI

摘要

我們提出「基於顯式人類價值的強化學習」(RLEV),該方法將大型語言模型(LLM)的最佳化過程直接與可量化的人類價值信號對齊。雖然「可驗證獎勵的強化學習」(RLVR)能透過二元正確性獎勵在客觀領域有效訓練模型,但其忽略了不同任務的重要性存在差異。RLEV通過將人類定義的價值信號直接整合至獎勵函數中,擴展了此框架。使用帶有明確真實價值標籤的考試型數據時,RLEV在多種強化學習演算法與模型規模下均持續超越僅基於正確性的基準方法。關鍵在於,RLEV策略不僅提升價值加權準確率,更學會價值敏感的終止策略:對低價值提示簡潔回應,對高價值提示深入闡釋。我們證明此行為源於序列結束符號上的價值加權梯度放大。消融實驗證實效益與價值對齊存在因果關聯。即使在噪聲價值信號(如基於難度的標籤)下,RLEV仍保持穩健性,表明針對顯式效用函數的最佳化為LLM與人類優先事項對齊提供了可行路徑。
English
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
PDF182December 2, 2025