每个问题皆有价值：基于显式人类价值观的强化学习

摘要

我们提出基于显式人类价值的强化学习（RLEV），该方法将大语言模型（LLM）优化与可量化的人类价值信号直接对齐。虽然可验证奖励的强化学习（RLVR）能通过二元正确性奖励在客观领域有效训练模型，但其忽略了不同任务的重要性存在差异。RLEV通过将人类定义的价值信号直接融入奖励函数，扩展了这一框架。使用带有显式真实价值标签的考试型数据时，RLEV在多种强化学习算法和模型规模下均持续优于仅关注正确性的基线方法。关键的是，RLEV策略不仅提升了价值加权准确率，还学会了价值敏感的终止策略：对低价值提示简洁回应，对高价值提示详尽阐述。我们证明该行为源于序列结束符上价值加权梯度的放大效应。消融实验证实性能提升与价值对齐存在因果关联。即使在噪声价值信号（如基于难度的标签）下，RLEV仍保持稳健性，这表明通过优化显式效用函数为实现LLM与人类优先级对齐提供了可行路径。

English

We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.