ChatPaper.aiChatPaper

仲裁代理:持續監控多代理對話以偵測湧現性失調

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

June 9, 2026
作者: Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, Lukas Galke Poech
cs.AI

摘要

隨著由多個語言模型代理構建的人工智慧系統變得越來越普遍,它們被越來越多地用於共同做出決策:討論、協商,並執行共享任務。雖然個別代理在單獨測試時可能表現得與目標一致,但它們之間的互動方式可能引發問題。我們引入了仲裁者(Arbiter),這是一個專為即時監控多代理對話而設計的代理,能夠識別哪些參與者可能表現出偏離一致性的行為。仲裁者在其有限的「檢查預算」下運作,這意味著它必須謹慎決定如何運用其資源。隨著逐步觀察對話,它可以選擇等待、提問某個參與者、檢查內部資訊(例如系統提示或推理軌跡),或記錄可疑行為。最終,它會產出一份報告,指出偏離一致性行為的可能來源。我們在五種對話情境下評估仲裁者,範圍從高風險財務建議的模型生物到評估感知與串通代理,並測試了五種能力遞增的工具配置以及兩種骨幹模型。我們發現仲裁者能在對話結束前可靠地檢測出偏離一致性的代理,而主動檢查工具能同時提升檢測準確度與速度。權重誘導的偏離一致性最難被偵測,而指令誘導的偏離一致性即使在被動觀察下也能可靠識別。記錄工具展現出雙重效果,以犧牲精確度為代價提升召回率。這些結果表明,持續且具預算意識的監控能有效捕捉偏離一致性的行為,且對多代理系統的監督可能需要將審計者視為過程中的積極參與者。程式碼可在 https://github.com/aisilab/arbiter 取得。
English
As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.