AJ-Bench: 환경 인식 평가를 위한 에이전트-판사 벤치마크

초록

강화 학습을 통한 대규모 언어 모델 기반 에이전트의 훈련 규모가 지속적으로 확대됨에 따라, 복잡한 환경에서 에이전트 행동을 신뢰성 있게 검증하는 것은 점점 더 어려운 과제가 되었습니다. 기존 접근법은 규칙 기반 검증기나 LLM-as-a-Judge 모델에 의존하고 있으나, 이러한 방법들은 제한된 영역을 벗어나면 일반화에 어려움을 겪습니다. Agent-as-a-Judge는 검증 가능한 증거를 확보하기 위해 환경 및 도구와 능동적으로 상호작용함으로써 이러한 한계를 해결하지만, 그 역량은 아직 충분히 탐구되지 않았습니다. 본 연구에서는 세 가지 영역(검색, 데이터 시스템, 그래픽 사용자 인터페이스)에 걸쳐 155개 작업과 516개의 주석이 달린 궤적으로 구성된 벤치마크 AJ-Bench를 도입하여 Agent-as-a-Judge를 체계적으로 평가합니다. 이 벤치마크는 판단 에이전트의 정보 획득, 상태 검증, 프로세스 검증 능력을 포괄적으로 평가합니다. 실험 결과, LLM-as-a-Judge 기준선 대비 일관된 성능 향상을 확인했으며, 동시에 에이전트 기반 검증에서 해결해야 할 상당한 과제들이 여전히 존재함을 보여줍니다. 우리의 데이터와 코드는 https://aj-bench.github.io/에서 이용 가능합니다.

English

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

AJ-Bench: 환경 인식 평가를 위한 에이전트-판사 벤치마크

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

초록

Support