RLVER:基於可驗證情感獎勵的強化學習用於共情代理
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
July 3, 2025
作者: Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
cs.AI
摘要
大型語言模型(LLMs)在邏輯和算法推理方面表現卓越,但其情感智能(EQ)仍遠遠落後於其認知能力。儘管基於可驗證獎勵的強化學習(RLVR)在其他領域取得了進展,但其在對話中的應用——尤其是針對情感智能的應用——仍未被充分探索。在本研究中,我們引入了RLVER,這是首個端到端的強化學習框架,利用來自模擬用戶的可驗證情感獎勵來培養LLMs的高階共情能力。在此框架內,自我一致的情感模擬用戶參與對話展開,並在對話過程中生成確定性的情感分數,作為引導LLM學習的獎勵信號。通過使用PPO對公開可用的Qwen2.5-7B-Instruct模型進行微調,其Sentient-Benchmark分數從13.3提升至79.2,同時在很大程度上保留了數學和編程能力。大量實驗表明:(i)RLVER持續提升了多種對話能力;(ii)思考型與非思考型模型呈現出不同的趨勢——思考型模型在共情和洞察力方面表現優異,而非思考型模型則更傾向於行動;(iii)GRPO通常帶來穩定的增益,而PPO則能將某些能力推向更高的天花板;(iv)更具挑戰性的環境並不一定更好——適度的環境反而能產生更強的結果。我們的結果表明,RLVER是實現情感智能且具備廣泛能力的語言代理的實用途徑。
English
Large language models (LLMs) excel at logical and algorithmic reasoning, yet
their emotional intelligence (EQ) still lags far behind their cognitive
prowess. While reinforcement learning from verifiable rewards (RLVR) has
advanced in other domains, its application to dialogue-especially for emotional
intelligence-remains underexplored. In this work, we introduce RLVER, the first
end-to-end reinforcement learning framework that leverages verifiable emotion
rewards from simulated users to cultivate higher-order empathetic abilities in
LLMs. Within this framework, self-consistent affective simulated users engage
in dialogue rollouts and produce deterministic emotion scores during
conversations, serving as reward signals to guide the LLM's learning.
Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its
Sentient-Benchmark score from 13.3 to 79.2 while largely preserving
mathematical and coding competence. Extensive experiments reveal that: (i)
RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and
non-thinking models show distinct trends--thinking models excel in empathy and
insight, while non-thinking models favor action; (iii) GRPO often yields stable
gains, while PPO can push certain capabilities to a higher ceiling; (iv) More
challenging environments are not always better-moderate ones can yield stronger
outcomes. Our results show that RLVER is a practical route toward emotionally
intelligent and broadly capable language agents.