每个问题皆有价值:基于显式人类价值观的强化学习
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
October 23, 2025
作者: Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
cs.AI
摘要
我们提出基于显式人类价值的强化学习(RLEV),该方法将大语言模型(LLM)优化与可量化的人类价值信号直接对齐。虽然可验证奖励的强化学习(RLVR)能通过二元正确性奖励在客观领域有效训练模型,但其忽略了不同任务的重要性存在差异。RLEV通过将人类定义的价值信号直接融入奖励函数,扩展了这一框架。使用带有显式真实价值标签的考试型数据时,RLEV在多种强化学习算法和模型规模下均持续优于仅关注正确性的基线方法。关键的是,RLEV策略不仅提升了价值加权准确率,还学会了价值敏感的终止策略:对低价值提示简洁回应,对高价值提示详尽阐述。我们证明该行为源于序列结束符上价值加权梯度的放大效应。消融实验证实性能提升与价值对齐存在因果关联。即使在噪声价值信号(如基于难度的标签)下,RLEV仍保持稳健性,这表明通过优化显式效用函数为实现LLM与人类优先级对齐提供了可行路径。
English
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method
that aligns Large Language Model (LLM) optimization directly with quantifiable
human value signals. While Reinforcement Learning with Verifiable Rewards
(RLVR) effectively trains models in objective domains using binary correctness
rewards, it overlooks that not all tasks are equally significant. RLEV extends
this framework by incorporating human-defined value signals directly into the
reward function. Using exam-style data with explicit ground-truth value labels,
RLEV consistently outperforms correctness-only baselines across multiple RL
algorithms and model scales. Crucially, RLEV policies not only improve
value-weighted accuracy but also learn a value-sensitive termination policy:
concise for low-value prompts, thorough for high-value ones. We demonstrate
this behavior stems from value-weighted gradient amplification on
end-of-sequence tokens. Ablation studies confirm the gain is causally linked to
value alignment. RLEV remains robust under noisy value signals, such as
difficulty-based labels, demonstrating that optimizing for an explicit utility
function offers a practical path to aligning LLMs with human priorities.