Sotopia-RL: 사회적 지능을 위한 보상 설계

초록

사회적 지능은 대규모 언어 모델(LLMs)에게 있어서 필수적인 능력으로 자리 잡았으며, 이를 통해 모델들은 숙박, 설득, 협업, 협상과 같은 실제 사회적 과제에 효과적으로 참여할 수 있게 되었습니다. 강화 학습(RL)은 사회적으로 지능적인 에이전트를 훈련시키기에 자연스럽게 적합한 방법으로, 모델들이 사회적 상호작용을 통해 직접 정교한 전략을 학습할 수 있게 합니다. 그러나 사회적 상호작용은 RL 훈련에 장벽을 세우는 두 가지 주요 특성을 가지고 있습니다: (1) 부분 관찰 가능성으로, 발화가 간접적이고 지연된 효과를 가지며 이는 신용 할당을 복잡하게 만듭니다. (2) 다차원성으로, 라포 형성이나 지식 탐색과 같은 행동들이 목표 달성에 간접적으로 기여합니다. 이러한 특성들은 단일 차원의 에피소드 수준 보상을 사용하는 마르코프 결정 과정(MDP) 기반 RL을 비효율적이고 불안정하게 만듭니다. 이러한 문제를 해결하기 위해, 우리는 Sotopia-RL이라는 새로운 프레임워크를 제안합니다. 이 프레임워크는 거친 에피소드 수준 피드백을 발화 수준의 다차원 보상으로 정제합니다. 발화 수준 신용 할당은 결과를 개별 발화에 귀속시켜 부분 관찰 가능성을 완화하고, 다차원 보상은 사회적 상호작용의 전체 풍부함을 포착하며 보상 해킹을 줄입니다. 개방형 사회 학습 환경인 Sotopia에서의 실험은 Sotopia-RL이 최신의 사회적 목표 달성 점수(Sotopia-hard에서 7.17, Sotopia-full에서 8.31)를 달성하며 기존 접근법을 크게 능가함을 보여줍니다. 제거 연구는 RL 훈련을 위해 발화 수준 신용 할당과 다차원 보상 설계가 모두 필요함을 확인합니다. 우리의 구현은 https://github.com/sotopia-lab/sotopia-rl에서 공개적으로 이용 가능합니다.

English

Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

Sotopia-RL: 사회적 지능을 위한 보상 설계

Sotopia-RL: Reward Design for Social Intelligence

초록

Support