协同博弈：基于心智理论的双面防御者信念引导学习

摘要

随着大语言模型（LLM）成为对话系统的核心引擎，其推断对话伙伴意图与状态（即形成并运用心理理论，ToM）的能力，对于与潜在对抗性伙伴进行安全交互变得愈发关键。我们提出一项新颖的隐私主题ToM挑战——信念导向型心理理论（ToM-SB），要求防御者扮演双面特工，在共享信息环境中引导具有部分先验知识的攻击者形成特定信念。要成功实现ToM-SB，防御者需主动构建对攻击者的心理模型，最终诱使攻击者误认为已成功获取敏感信息。研究发现，Gemini3-Pro和GPT-5.4等前沿模型在ToM-SB任务中表现不佳——即使采用ToM提示引导其推理攻击者信念，仍难以在攻击者具备部分先验知识的困难场景中成功迷惑对手。为弥补这一差距，我们通过强化学习训练AI双面特工模型，同时测试迷惑效果与ToM奖励机制。值得注意的是，我们发现ToM能力与攻击者迷惑效果存在双向涌现关系：仅奖励迷惑成功即可提升ToM能力，而仅奖励ToM表现也能增强迷惑效果。通过对四种不同强度攻击者、六种防御方法开展分布内与分布外（OOD）评估，我们证实ToM能力提升与攻击者迷惑效果呈显著正相关，表明信念建模是ToM-SB成功的关键驱动力。结合ToM与迷惑双重奖励的AI双面特工实现了最优性能，在困难场景下的表现超越采用ToM提示的Gemini3-Pro和GPT-5.4。研究还表明，ToM-SB任务与AI双面特工策略可扩展至更强攻击者，展现了向OOD场景的泛化能力及任务的可升级性。

English

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

协同博弈：基于心智理论的双面防御者信念引导学习

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

摘要

Support