『プレイング・アロング：心の理論による信念操作のための二重スパイ防衛手法の学習』

要旨

大規模言語モデル（LLM）が対話システムの基盤となるにつれ、対話相手の意図や状態を推論する能力（すなわち、心の理論（ToM）を形成し活用する能力）は、潜在的に敵対的な相手との安全な相互作用において極めて重要になっている。本研究では、プライバシーをテーマとした新たなToM課題「信念誘導のための心の理論（ToM-SB）」を提案する。この課題では、防御側エージェントが二重スパイ（ダブルエージェント）として振る舞い、共有された世界観において部分的な事前知識を持つ攻撃側エージェントの信念を誘導しなければならない。ToM-SBを成功させるには、防御側エージェントは攻撃側エージェントと関わり、そのToMを形成し、攻撃側に機密情報の抽出成功を信じ込ませることを目標とする。我々は、Gemini3-ProやGPT-5.4のような強力な最先端モデルでさえ、ToM-SBにおいて苦戦し、攻撃側の事前知識が部分的な困難なシナリオでは、攻撃側の信念について推論するよう促されても（ToMプロンプティング）、攻撃側を欺くことにしばしば失敗することを見出した。このギャップを埋めるため、強化学習を用いてToM-SB上でAIダブルエージェントとして行動するモデルを訓練し、欺瞞成功とToMの両方に対する報酬を検証した。特に、ToMと攻撃側欺瞞の間には双方向的な創発的関係があることを発見した：欺瞞成功のみを報酬するとToMが向上し、ToMのみを報酬すると欺瞞が向上するのである。異なる強度を持つ4種類の攻撃側モデル、6つの防御側手法、そして分布内評価と分布外評価の両方において、ToMの向上と攻撃側欺瞞の向上には高い相関があり、信念モデリングがToM-SBの成功における主要な駆動力であることが示された。ToMと欺瞞報酬を組み合わせたAIダブルエージェントは、最も強力な欺瞞性能とToM性能を発揮し、困難なシナリオにおいてToMプロンプティングを施したGemini3-ProおよびGPT-5.4を凌駕した。さらに、ToM-SBとAIダブルエージェントはより強力な攻撃側に拡張可能であり、分布外設定への一般化と本課題のアップグレード可能性も実証した。

English

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

『プレイング・アロング：心の理論による信念操作のための二重スパイ防衛手法の学習』

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

要旨

Support