SAVOIR：基于沙普利值奖励归因的社交礼仪学习框架

摘要

社会智能——驾驭复杂人际互动的能力，对语言智能体提出了根本性挑战。通过强化学习训练此类智能体需解决信用分配问题：如何确定单轮对话对多轮互动结果的贡献。现有方法直接使用语言模型分配回合级奖励，产生的归因结果既具有回溯性又缺乏理论依据。我们提出SAVOIR（基于合作博弈论的沙普利值社会强化学习框架），这一新型原理性框架植根于合作博弈论。我们的方法融合两个互补原则：期望效用将评估从回溯归因转向前瞻估值，捕捉话语在促成有利未来轨迹方面的战略潜力；沙普利值则通过效率性、对称性和边际性的公理保证，实现公平的信用分配。在SOTOPIA基准测试中，SAVOIR在所有评估设置下均实现最新最优性能，我们的70亿参数模型达到或超越了包括GPT-4o和Claude-3.5-Sonnet在内的专有模型。值得注意的是，即使大型推理模型也持续表现不佳，这表明社会智能需要与分析推理截然不同的能力特质。

English

Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.

SAVOIR：基于沙普利值奖励归因的社交礼仪学习框架

SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

摘要

Support