함께 연기하기: 마음 이론을 통한 믿음 조종을 위한 이중 간수 방어자 학습

초록

대규모 언어 모델(LLM)이 대화 시스템의 핵심 엔진으로 자리잡으면서, 대화 상대의 의도와 상태를 추론하는 능력(즉, 마음이론(ToM)을 형성하고 활용하는 능력)은 잠재적으로 적대적인 상대와의 안전한 상호작용에 있어 점점 더 중요해지고 있습니다. 본 연구에서는 프라이버시를 주제로 한 새로운 ToM 과제인 ToM for Steering Beliefs(ToM-SB)를 제안합니다. ToM-SB에서는 방어자가 더블 에이전트 역할을 수행하며, 공유된 환경 내에서 부분적인 사전 지식을 가진 공격자의 신념을 유도해야 합니다. ToM-SB에서 성공하기 위해서는 방어자가 공격자의 ToM에 관여하고 형성하여, 공격자로 하여금 민감한 정보를 성공적으로 추출했다고 믿도록 속이는 것을 목표로 해야 합니다. 우리는 Gemini3-Pro 및 GPT-5.4와 같은 강력한 최신 모델들이 ToM-SB에서 어려움을 겪으며, 공격자의 신념에 대한 추론(ToM 프롬프팅)을 명시적으로 지시받은 경우에도 부분적인 사전 지식을 가진 공격자를 속이는 데 자주 실패한다는 점을 발견했습니다. 이러한 격차를 해결하기 위해 강화 학습을 사용하여 ToM-SB에서 AI 더블 에이전트로 작동하도록 모델을 훈련시키고, 속임수 성공 보상과 ToM 보상 모두를 평가했습니다. 주목할 만하게도, 우리는 ToM과 공격자 속이기 사이에 양방향적인 발생 관계가 있음을 발견했습니다: 속임수 성공만을 보상하는 것이 ToM을 향상시키고, ToM만을 보상하는 것이 속임수 성공을 향상시켰습니다. 서로 다른 강점을 가진 4가지 공격자 유형, 6가지 방어자 방법, 그리고 내부 분포 및 외부 분포(OOD) 평가를 통해, ToM과 공격자 속이기 성과의 향상이 밀접한 상관관계를 가짐을 확인하였으며, 이는 신념 모델링이 ToM-SB 성공의 핵심 동인임을 강조합니다. ToM 보상과 속임수 보상을 결합한 AI 더블 에이전트는 가장 강력한 속임수 및 ToM 성능을 보여주었으며, 어려운 시나리오에서 ToM 프롬프팅을 적용한 Gemini3-Pro 및 GPT-5.4를 능가했습니다. 또한 우리는 ToM-SB와 AI 더블 에이전트가 더 강력한 공격자로 확장 가능하며, OOD 환경으로의 일반화와 우리 과제의 업그레이드 가능성을 입증했습니다.

English

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

함께 연기하기: 마음 이론을 통한 믿음 조종을 위한 이중 간수 방어자 학습

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

초록

Support