RLVER:基于可验证情感奖励的强化学习框架,用于构建共情智能体
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
July 3, 2025
作者: Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
cs.AI
摘要
大型语言模型(LLMs)在逻辑与算法推理方面表现出色,但其情感智能(EQ)仍远不及认知能力。尽管基于可验证奖励的强化学习(RLVR)在其他领域取得了进展,但在对话中的应用——尤其是针对情感智能——仍未被充分探索。本研究提出了RLVER,首个端到端的强化学习框架,利用模拟用户提供的可验证情感奖励,培养LLMs的高阶共情能力。在此框架下,情感一致的模拟用户参与对话展开,并在对话过程中生成确定性的情感评分,作为指导LLM学习的奖励信号。通过PPO算法对公开可用的Qwen2.5-7B-Instruct模型进行微调,其Sentient-Benchmark得分从13.3提升至79.2,同时基本保留了数学与编程能力。大量实验表明:(i) RLVER持续提升多项对话能力;(ii) 思维型与非思维型模型呈现不同趋势——思维型模型在共情与洞察力上表现更佳,非思维型模型则更倾向于行动;(iii) GRPO通常带来稳定的增益,而PPO能将某些能力推向更高水平;(iv) 更具挑战性的环境并非总是更优,适中的环境可能产生更强的效果。我们的结果表明,RLVER是实现情感智能且具备广泛能力的语言代理的一条实用路径。
English
Large language models (LLMs) excel at logical and algorithmic reasoning, yet
their emotional intelligence (EQ) still lags far behind their cognitive
prowess. While reinforcement learning from verifiable rewards (RLVR) has
advanced in other domains, its application to dialogue-especially for emotional
intelligence-remains underexplored. In this work, we introduce RLVER, the first
end-to-end reinforcement learning framework that leverages verifiable emotion
rewards from simulated users to cultivate higher-order empathetic abilities in
LLMs. Within this framework, self-consistent affective simulated users engage
in dialogue rollouts and produce deterministic emotion scores during
conversations, serving as reward signals to guide the LLM's learning.
Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its
Sentient-Benchmark score from 13.3 to 79.2 while largely preserving
mathematical and coding competence. Extensive experiments reveal that: (i)
RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and
non-thinking models show distinct trends--thinking models excel in empathy and
insight, while non-thinking models favor action; (iii) GRPO often yields stable
gains, while PPO can push certain capabilities to a higher ceiling; (iv) More
challenging environments are not always better-moderate ones can yield stronger
outcomes. Our results show that RLVER is a practical route toward emotionally
intelligent and broadly capable language agents.