중재 에이전트: 다중 에이전트 대화를 지속적으로 모니터링하여 창발적 오정렬 탐지

초록

다중 언어 모델 에이전트로 구축된 AI 시스템이 보편화됨에 따라, 이들은 공유된 작업에 대해 논의하고, 협상하며, 행동하는 등 함께 의사결정을 내리는 데 점점 더 많이 사용되고 있다. 개별 에이전트는 단독으로 테스트되었을 때 잘 정렬된 것처럼 보일 수 있지만, 이들이 서로 상호작용하는 방식에서 문제가 발생할 수 있다. 우리는 다중 에이전트 대화를 실시간으로 모니터링하고 어떤 참가자가 부정합 방식으로 행동하고 있는지 식별하도록 설계된 에이전트인 Arbiter를 소개한다. Arbiter는 제한된 "검사 예산(inspection budget)" 하에서 작동하므로, 자원을 어떻게 사용할지 신중히 결정해야 한다. 대화를 단계별로 관찰하면서, 대기하거나, 참가자에게 질문하거나, 시스템 프롬프트나 추론 과정과 같은 내부 정보를 조사하거나, 우려되는 행동을 기록할 수 있다. 마지막에는 부정합의 가능한 원인을 식별하는 보고서를 생성한다. 우리는 위험한 금융 조언 모델 유기체부터 평가 인식 및 공모 에이전트에 이르기까지 다섯 가지 대화 조건에 걸쳐 Arbiter를 평가했으며, 증가된 기능을 가진 다섯 가지 도구 구성과 두 가지 백본 모델을 테스트했다. 그 결과 Arbiter는 대화가 끝나기 훨씬 전에 부정합 에이전트를 신뢰할 수 있게 탐지하며, 능동적 검사 도구는 탐지 정확도와 속도를 모두 향상시킨다는 것을 발견했다. 가중치 유발 부정합은 탐지가 가장 어려운 반면, 명령 유발 부정합은 수동적 관찰에서도 안정적으로 식별되었다. 기록 도구는 재현율을 향상시키는 대신 정밀도를 희생시키는 이중 효과를 보였다. 이러한 결과는 지속적이고 예산을 인식하는 모니터링이 부정합을 효과적으로 포착할 수 있으며, 다중 에이전트 시스템을 감독하려면 감사자를 과정의 적극적 참여자로 취급해야 할 수 있음을 시사한다. 코드는 https://github.com/aisilab/arbiter 에서 확인할 수 있다.

English

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.