調停エージェント：マルチエージェント会話の継続的監視による創発的ミスアラインメントの検出

要旨

複数の言語モデルエージェントから構築されたAIシステムが一般的になるにつれて、それらは共有タスクについて議論、交渉、行動を行うという共同意思決定にますます利用されるようになっている。個々のエージェントは単独でテストした場合には十分に整合しているように見えるかもしれないが、それらが互いに相互作用する方法から問題が生じる可能性がある。我々は、マルチエージェント会話をリアルタイムで監視し、どの参加者が非整合的な行動をとっている可能性があるかを特定するように設計されたエージェントであるArbiterを導入する。Arbiterは限られた「検査予算」の下で動作し、リソースの使用方法を慎重に決定する必要がある。会話を段階的に観察しながら、待機する、参加者に質問する、システムプロンプトや推論トレースなどの内部情報を調査する、または懸念される行動をログに記録するという選択が可能であり、最後に非整合の原因と思われるものを特定するレポートを生成する。我々は、リスクの高い金融アドバイスモデル生物から評価認識型および共謀型エージェントに至るまでの5つの会話条件にわたってArbiterを評価し、能力が増大する5つのツール構成と2つのバックボーンモデルをテストした。その結果、Arbiterは会話の終了よりかなり前に非整合的なエージェントを確実に検出し、能動的な検査ツールが検出精度と速度の両方を向上させることがわかった。重み起因の非整合は検出が最も困難である一方、指示起因の非整合は受動的観察下でも確実に識別される。ログ記録ツールは二重の効果を示し、精度を犠牲にして再現率を向上させる。これらの結果は、継続的かつ予算を意識した監視が非整合を効果的に捉えることができること、そしてマルチエージェントシステムの監督には、監査人をプロセスにおける能動的な参加者として扱うことが必要となる可能性があることを示唆している。コードは https://github.com/aisilab/arbiter で入手可能である。

English

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.