Sotopia-RL：社交智能的獎勵機制設計

摘要

社交智能已成為大型語言模型（LLMs）的關鍵能力，使其能夠有效地參與現實世界中的社交任務，如適應、說服、協作和談判。強化學習（RL）天然適合訓練具有社交智能的代理，因為它允許模型直接通過社交互動學習複雜的策略。然而，社交互動具有兩個關鍵特性，為RL訓練設置了障礙：（1）部分可觀測性，即話語具有間接和延遲的影響，使信用分配複雜化；（2）多維度性，即如建立融洽關係或尋求知識等行為間接貢獻於目標達成。這些特性使得基於馬可夫決策過程（MDP）的RL，在單維度回合級獎勵下效率低下且不穩定。為應對這些挑戰，我們提出了Sotopia-RL，一個新穎的框架，將粗糙的回合級反饋細化為話語級、多維度的獎勵。話語級信用分配通過將結果歸因於個別話語來緩解部分可觀測性，而多維度獎勵則捕捉了社交互動的全部豐富性，並減少了獎勵欺騙。在Sotopia，一個開放式社交學習環境中的實驗表明，Sotopia-RL在社交目標完成得分上達到了最先進的水平（在Sotopia-hard上為7.17，在Sotopia-full上為8.31），顯著超越了現有方法。消融研究證實了話語級信用分配和多維度獎勵設計對於RL訓練的必要性。我們的實現已公開於：https://github.com/sotopia-lab/sotopia-rl。

English

Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

Sotopia-RL：社交智能的獎勵機制設計

Sotopia-RL: Reward Design for Social Intelligence

摘要

Support