协同博弈:基于心智理论的双面防御者信念引导学习策略
Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
April 13, 2026
作者: Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
cs.AI
摘要
随着大型语言模型(LLMs)逐渐成为对话系统的核心引擎,其推断对话对象意图与状态的能力(即形成并运用心智理论,ToM)对于与潜在对抗性伙伴的安全交互变得愈发关键。我们提出了一项新颖的隐私主题心智理论挑战——引导信念的心智理论(ToM-SB),要求防御者扮演双面间谍角色,在共享信息环境中引导具有部分先验知识的攻击者的信念。要成功完成ToM-SB任务,防御者必须与攻击者进行互动并构建其心智模型,最终诱使攻击者误认为已成功窃取敏感信息。研究发现,Gemini3-Pro和GPT-5.4等前沿强模型在ToM-SB任务中表现不佳——即使采用心智理论提示(ToM prompting),在攻击者具备部分先验知识的高难度场景中也难以成功迷惑对手。为弥补这一缺陷,我们通过强化学习训练了可充当AI双面间谍的ToM-SB模型,同时测试了欺骗效果和心智理论两类奖励机制。值得注意的是,我们发现心智理论与攻击者欺骗之间存在双向涌现关系:仅奖励欺骗成功即可提升心智理论能力,而仅奖励心智理论表现也能增强欺骗效果。通过对四种不同强度的攻击者、六种防御方法开展分布内与分布外(OOD)评估,我们观察到心智理论水平与欺骗成功率呈高度正相关,表明信念建模是ToM-SB成功的关键驱动因素。结合心智理论与欺骗双重奖励的AI双面间谍模型表现出最强的欺骗能力和心智理论水平,在高难度场景中的表现超越了采用心智理论提示的Gemini3-Pro和GPT-5.4。研究还表明,ToM-SB任务与AI双面间谍模型可扩展至更强攻击者场景,在OOD设置中展现泛化能力,体现了该任务的可升级性。
English
As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.