RLVER：基于可验证情感奖励的强化学习框架，用于构建共情智能体

摘要

大型语言模型（LLMs）在逻辑与算法推理方面表现出色，但其情感智能（EQ）仍远不及认知能力。尽管基于可验证奖励的强化学习（RLVR）在其他领域取得了进展，但在对话中的应用——尤其是针对情感智能——仍未被充分探索。本研究提出了RLVER，首个端到端的强化学习框架，利用模拟用户提供的可验证情感奖励，培养LLMs的高阶共情能力。在此框架下，情感一致的模拟用户参与对话展开，并在对话过程中生成确定性的情感评分，作为指导LLM学习的奖励信号。通过PPO算法对公开可用的Qwen2.5-7B-Instruct模型进行微调，其Sentient-Benchmark得分从13.3提升至79.2，同时基本保留了数学与编程能力。大量实验表明：(i) RLVER持续提升多项对话能力；(ii) 思维型与非思维型模型呈现不同趋势——思维型模型在共情与洞察力上表现更佳，非思维型模型则更倾向于行动；(iii) GRPO通常带来稳定的增益，而PPO能将某些能力推向更高水平；(iv) 更具挑战性的环境并非总是更优，适中的环境可能产生更强的效果。我们的结果表明，RLVER是实现情感智能且具备广泛能力的语言代理的一条实用路径。

English

Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

RLVER：基于可验证情感奖励的强化学习框架，用于构建共情智能体

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

摘要

Support