仲裁代理：持续监控多智能体对话以检测涌现的不对齐性

摘要

随着由多个语言模型智能体构建的人工智能系统日益普及，它们越来越多地被用于共同决策：讨论、协商并执行共享任务。尽管单个智能体在独立测试时可能表现出良好的一致性，但它们之间的交互方式可能引发问题。我们引入"仲裁者"（Arbiter）这一智能体，旨在实时监控多智能体对话，识别哪些参与者可能存在行为失调。仲裁者在有限的"检查预算"下运作，这意味着它必须谨慎决定如何分配资源。在逐步观察对话的过程中，它可以选择等待、质询某个参与者、检查内部信息（如系统提示或推理轨迹），或记录可疑行为。最终，它会生成一份报告，指出失调的可能来源。我们在五种对话条件下评估仲裁者，涵盖从高风险财务建议的模型有机体到评估感知型与共谋型智能体，测试了五种能力递增的工具配置及两种骨干模型。研究发现，仲裁者能在对话结束前可靠地检测到失调智能体，主动检查工具既提升了检测准确率也加快了检测速度。权重引发的失调最难检测，而指令引发的失调即使在被动观察下也能被可靠识别。记录工具呈现出双重效应：以牺牲精确率为代价提高了召回率。这些结果表明，持续且预算感知的监控能有效捕捉失调行为，而监督多智能体系统可能需要将审计者视为流程中的主动参与者。相关代码已开源至 https://github.com/aisilab/arbiter。

English

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.