Sotopia-RL:面向社交智能的奖励机制设计
Sotopia-RL: Reward Design for Social Intelligence
August 5, 2025
作者: Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You
cs.AI
摘要
社交智能已成为大型语言模型(LLMs)的一项关键能力,使其能够在现实世界的社交任务中有效参与,如适应、说服、协作和谈判。强化学习(RL)天然适合训练具备社交智能的代理,因为它允许模型直接通过社交互动学习复杂策略。然而,社交互动具有两个关键特性,为RL训练设置了障碍:(1)部分可观测性,即话语具有间接和延迟效应,使信用分配复杂化;(2)多维度性,如建立融洽关系或寻求知识等行为间接促进目标达成。这些特性使得基于马尔可夫决策过程(MDP)的单维度回合级奖励RL效率低下且不稳定。为应对这些挑战,我们提出了Sotopia-RL,一个将粗糙的回合级反馈细化为话语级、多维度奖励的新框架。话语级信用分配通过将结果归因于个别话语来缓解部分可观测性问题,而多维度奖励则捕捉社交互动的全部丰富性,减少奖励欺骗。在开放式社交学习环境Sotopia中的实验表明,Sotopia-RL在社交目标完成度上达到了最先进水平(Sotopia-hard上7.17分,Sotopia-full上8.31分),显著优于现有方法。消融研究证实了话语级信用分配和多维度奖励设计对于RL训练的必要性。我们的实现已公开于:https://github.com/sotopia-lab/sotopia-rl。
English
Social intelligence has become a critical capability for large language
models (LLMs), enabling them to engage effectively in real-world social tasks
such as accommodation, persuasion, collaboration, and negotiation.
Reinforcement learning (RL) is a natural fit for training socially intelligent
agents because it allows models to learn sophisticated strategies directly
through social interactions. However, social interactions have two key
characteristics that set barriers for RL training: (1) partial observability,
where utterances have indirect and delayed effects that complicate credit
assignment, and (2) multi-dimensionality, where behaviors such as
rapport-building or knowledge-seeking contribute indirectly to goal
achievement. These characteristics make Markov decision process (MDP)-based RL
with single-dimensional episode-level rewards inefficient and unstable. To
address these challenges, we propose Sotopia-RL, a novel framework that refines
coarse episode-level feedback into utterance-level, multi-dimensional rewards.
Utterance-level credit assignment mitigates partial observability by
attributing outcomes to individual utterances, while multi-dimensional rewards
capture the full richness of social interactions and reduce reward hacking.
Experiments in Sotopia, an open-ended social learning environment, demonstrate
that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17
on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing
approaches. Ablation studies confirm the necessity of both utterance-level
credit assignment and multi-dimensional reward design for RL training. Our
implementation is publicly available at:
https://github.com/sotopia-lab/sotopia-rl.