Sotopia-RL: 社会的知能のための報酬設計

要旨

社会的知性は、大規模言語モデル（LLM）にとって重要な能力となり、現実世界の社会的タスク（例：調整、説得、協力、交渉）に効果的に関与することを可能にしています。強化学習（RL）は、社会的に知的なエージェントを訓練するのに自然に適しています。なぜなら、RLはモデルが直接社会的相互作用を通じて洗練された戦略を学習することを可能にするからです。しかし、社会的相互作用には、RL訓練に障壁を設ける2つの重要な特性があります：（1）部分観測可能性。発話が間接的かつ遅延した効果を持ち、クレジット割り当てを複雑にする。（2）多次元性。ラポート構築や知識探索などの行動が、目標達成に間接的に寄与する。これらの特性により、単一次元のエピソードレベル報酬に基づくマルコフ決定過程（MDP）ベースのRLは非効率で不安定になります。これらの課題に対処するため、我々はSotopia-RLを提案します。これは、粗いエピソードレベルのフィードバックを発話レベル、多次元の報酬に精緻化する新しいフレームワークです。発話レベルのクレジット割り当ては、結果を個々の発話に帰属させることで部分観測可能性を緩和し、多次元報酬は社会的相互作用の豊かさを完全に捉え、報酬ハッキングを減らします。オープンエンドの社会的学習環境であるSotopiaでの実験では、Sotopia-RLが最先端の社会的目標達成スコア（Sotopia-hardで7.17、Sotopia-fullで8.31）を達成し、既存のアプローチを大幅に上回ることが示されました。アブレーション研究は、RL訓練における発話レベルのクレジット割り当てと多次元報酬設計の両方が必要であることを確認しています。我々の実装は、https://github.com/sotopia-lab/sotopia-rl で公開されています。

English

Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

Sotopia-RL: 社会的知能のための報酬設計

Sotopia-RL: Reward Design for Social Intelligence

要旨

Support