SAVOIR: シャプレー値に基づく報酬帰属による社会的サヴォアフェールの学習

要旨

社会的知性（複雑な対人相互作用を円滑に進める能力）は、言語エージェントにとって根本的な課題である。強化学習による当該エージェントの訓練には、信用割り当て問題、すなわち個々の発話が多ターンの対話結果にどのように寄与するかを決定する問題の解決が必要となる。既存手法では、エピソード単位の報酬を分配するために言語モデルを直接利用するが、これにより得られる帰属分析は回顧的で理論的根拠に乏しい。本研究では、協力ゲーム理論に基づく新規で原理的な枠組みであるSAVOIR（ShApley Value fOr SocIal RL）を提案する。本手法は、互いに補完する二つの原理を組み合わせる。期待効用理論は、評価を回顧的な帰属から将来的な価値評価へと転換し、発話が将来の好ましい軌道を実現するための戦略的可能性を捉える。シャプレー値は、効率性、対称性、限界性の公理的保証をもって、公平な信用配分を保証する。SOTOPIAベンチマークによる実験では、SAVOIRが全ての評価設定において新たなstate-of-the-art性能を達成し、我々の7BモデルがGPT-4oやClaude-3.5-Sonnetなどのプロプライエタリモデルに匹敵、あるいは凌駕する性能を示した。特筆すべきは、大規模な推論モデルでさえ一貫して性能が劣ることから、社会的知性には分析的推論とは質的に異なる能力が要求されることが示唆される。

English

Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.

SAVOIR: シャプレー値に基づく報酬帰属による社会的サヴォアフェールの学習

SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

要旨

Support