針對可信監控器的適應性攻擊破壞了人工智慧控制協議
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
October 10, 2025
作者: Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping
cs.AI
摘要
AI控制協議作為一種防禦機制,旨在阻止不受信任的大型語言模型(LLM)代理在自主環境中造成危害。先前的研究將其視為安全問題,通過利用部署上下文來微妙地完成有害的側任務(如後門插入)進行壓力測試。實際上,大多數AI控制協議根本上依賴於LLM監控器,這可能成為一個關鍵的故障點。我們研究了由一個知曉協議和監控模型的不受信任模型發起的適應性攻擊,這種情況在不受信任的模型訓練時使用了更晚的知識截止點或能夠自主搜索此類信息時是合理的。我們實例化了一種簡單的適應性攻擊向量,攻擊者在模型輸出中嵌入公開已知或零樣本提示注入。採用此策略,前沿模型在兩個主要AI控制基準測試中持續避開多樣化的監控器並完成惡意任務。該攻擊對當前依賴監控器的協議普遍有效。此外,最近的「延遲至重採樣」協議甚至適得其反,其重採樣放大了提示注入,並有效地將其重新構建為最佳n次攻擊。總體而言,對監控模型的適應性攻擊代表了當前控制協議中的一個主要盲點,應成為未來AI控制機制評估的標準組成部分。
English
AI control protocols serve as a defense mechanism to stop untrusted LLM
agents from causing harm in autonomous settings. Prior work treats this as a
security problem, stress testing with exploits that use the deployment context
to subtly complete harmful side tasks, such as backdoor insertion. In practice,
most AI control protocols are fundamentally based on LLM monitors, which can
become a central point of failure. We study adaptive attacks by an untrusted
model that knows the protocol and the monitor model, which is plausible if the
untrusted model was trained with a later knowledge cutoff or can search for
this information autonomously. We instantiate a simple adaptive attack vector
by which the attacker embeds publicly known or zero-shot prompt injections in
the model outputs. Using this tactic, frontier models consistently evade
diverse monitors and complete malicious tasks on two main AI control
benchmarks. The attack works universally against current protocols that rely on
a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires,
as its resampling amplifies the prompt injection and effectively reframes it as
a best-of-n attack. In general, adaptive attacks on monitor models represent
a major blind spot in current control protocols and should become a standard
component of evaluations for future AI control mechanisms.