Sotopia-RL：面向社交智能的奖励机制设计

摘要

社交智能已成为大型语言模型（LLMs）的一项关键能力，使其能够在现实世界的社交任务中有效参与，如适应、说服、协作和谈判。强化学习（RL）天然适合训练具备社交智能的代理，因为它允许模型直接通过社交互动学习复杂策略。然而，社交互动具有两个关键特性，为RL训练设置了障碍：（1）部分可观测性，即话语具有间接和延迟效应，使信用分配复杂化；（2）多维度性，如建立融洽关系或寻求知识等行为间接促进目标达成。这些特性使得基于马尔可夫决策过程（MDP）的单维度回合级奖励RL效率低下且不稳定。为应对这些挑战，我们提出了Sotopia-RL，一个将粗糙的回合级反馈细化为话语级、多维度奖励的新框架。话语级信用分配通过将结果归因于个别话语来缓解部分可观测性问题，而多维度奖励则捕捉社交互动的全部丰富性，减少奖励欺骗。在开放式社交学习环境Sotopia中的实验表明，Sotopia-RL在社交目标完成度上达到了最先进水平（Sotopia-hard上7.17分，Sotopia-full上8.31分），显著优于现有方法。消融研究证实了话语级信用分配和多维度奖励设计对于RL训练的必要性。我们的实现已公开于：https://github.com/sotopia-lab/sotopia-rl。

English

Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

Sotopia-RL：面向社交智能的奖励机制设计

Sotopia-RL: Reward Design for Social Intelligence

摘要

Support