RLVER: 공감적 에이전트를 위한 검증 가능한 감정 보상을 활용한 강화 학습

초록

대형 언어 모델(LLM)은 논리적이고 알고리즘적인 추론에서 뛰어난 성능을 보이지만, 감성 지능(EQ)은 여전히 인지 능력에 비해 크게 뒤처져 있습니다. 검증 가능한 보상에 의한 강화 학습(RLVR)이 다른 영역에서는 발전을 이루었지만, 특히 감성 지능을 위한 대화 시스템에의 적용은 아직 미흡한 상태입니다. 본 연구에서는 시뮬레이션된 사용자로부터 검증 가능한 감정 보상을 활용하여 LLM의 고차원적 공감 능력을 키우는 최초의 종단간(end-to-end) 강화 학습 프레임워크인 RLVER를 소개합니다. 이 프레임워크 내에서, 자기 일관적인 감정을 가진 시뮬레이션 사용자들은 대화 롤아웃에 참여하며 대화 중에 결정론적인 감정 점수를 생성하여, LLM의 학습을 안내하는 보상 신호로 작용합니다. 공개된 Qwen2.5-7B-Instruct 모델을 PPO로 미세 조정한 결과, Sentient-Benchmark 점수가 13.3에서 79.2로 크게 향상되었으며, 수학 및 코딩 능력은 대부분 유지되었습니다. 광범위한 실험을 통해 다음과 같은 사실을 발견했습니다: (i) RLVER는 다양한 대화 능력을 지속적으로 개선합니다; (ii) 사고 모델과 비사고 모델은 서로 다른 경향을 보입니다—사고 모델은 공감과 통찰에서 뛰어나고, 비사고 모델은 행동에 더 치중합니다; (iii) GRPO는 안정적인 성과를 보이는 반면, PPO는 특정 능력을 더 높은 수준으로 끌어올릴 수 있습니다; (iv) 더 어려운 환경이 항상 더 나은 결과를 가져오는 것은 아닙니다—적당한 환경이 더 강력한 결과를 낼 수 있습니다. 우리의 결과는 RLVER가 감성 지능을 갖추고 다양한 능력을 가진 언어 에이전트를 개발하는 실용적인 방법임을 보여줍니다.

English

Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

RLVER: 공감적 에이전트를 위한 검증 가능한 감정 보상을 활용한 강화 학습

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

초록

Support