RLVER：基於可驗證情感獎勵的強化學習用於共情代理

摘要

大型語言模型（LLMs）在邏輯和算法推理方面表現卓越，但其情感智能（EQ）仍遠遠落後於其認知能力。儘管基於可驗證獎勵的強化學習（RLVR）在其他領域取得了進展，但其在對話中的應用——尤其是針對情感智能的應用——仍未被充分探索。在本研究中，我們引入了RLVER，這是首個端到端的強化學習框架，利用來自模擬用戶的可驗證情感獎勵來培養LLMs的高階共情能力。在此框架內，自我一致的情感模擬用戶參與對話展開，並在對話過程中生成確定性的情感分數，作為引導LLM學習的獎勵信號。通過使用PPO對公開可用的Qwen2.5-7B-Instruct模型進行微調，其Sentient-Benchmark分數從13.3提升至79.2，同時在很大程度上保留了數學和編程能力。大量實驗表明：（i）RLVER持續提升了多種對話能力；（ii）思考型與非思考型模型呈現出不同的趨勢——思考型模型在共情和洞察力方面表現優異，而非思考型模型則更傾向於行動；（iii）GRPO通常帶來穩定的增益，而PPO則能將某些能力推向更高的天花板；（iv）更具挑戰性的環境並不一定更好——適度的環境反而能產生更強的結果。我們的結果表明，RLVER是實現情感智能且具備廣泛能力的語言代理的實用途徑。

English

Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

RLVER：基於可驗證情感獎勵的強化學習用於共情代理

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

摘要

Support