SAVOIR:基于沙普利值奖励归因的社交礼仪学习框架
SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution
April 21, 2026
作者: Xiachong Feng, Yi Jiang, Xiaocheng Feng, Deyi Yin, Libo Qin, Yangfan Ye, Lei Huang, Weitao Ma, Yuxuan Gu, Chonghan Qin, Bing Qin, Lingpeng Kong
cs.AI
摘要
社交智能作为驾驭复杂人际互动的能力,始终是语言智能体面临的核心挑战。基于强化学习的智能体训练需要解决信用分配问题:如何确定单轮对话对多轮互动结果的贡献度。现有方法直接使用语言模型分配情景级奖励,其归因方式既缺乏理论依据又局限于事后追溯。我们提出SAVOIR(基于沙普利值的社交强化学习)这一植根于合作博弈论的全新理论框架,该框架融合两大互补原则:期望效用原则将评估视角从 retrospective 追溯转为 prospective 前瞻,通过捕捉语句激发良性对话轨迹的战略潜力;沙普利值则凭借效率性、对称性、边际性等公理保证,实现公平的信用分配。在SOTOPIA基准测试中,SAVOIR在所有评估场景下均刷新性能纪录,我们的70亿参数模型达到甚至超越了GPT-4o、Claude-3.5-Sonnet等专有模型。值得注意的是,即便大型推理模型也持续表现不佳,这表明社交智能需要与逻辑推理截然不同的能力特质。
English
Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.