Counsel: 에이전트 작업을 위한 메타 평가 데이터셋

초록

에이전트 시스템이 점점 더 복잡한 다단계 작업을 처리함에 따라, 그 실행 경로를 평가하는 것은 주요 병목 현상이 되고 있습니다. 널리 사용되는 에이전트 벤치마크에서 단일 경로에 대한 인간 주석은 몇 시간이 소요될 수 있기 때문에, 성능 측정이나 훈련 데이터 큐레이션을 위한 평가를 확장하기 어렵게 만듭니다. 이로 인해 LLM-as-a-judge(LLMJ)와 같은 자동화된 접근 방식에 대한 광범위한 의존이 발생하여, 규모에 맞게 에이전트를 프로세스 및 결과 수준에서 비판적으로 평가하게 되었습니다. 그러나 LLMJ 비판의 타당성은 종종 측정되지 않습니다. 여기서 우리는 에이전트 작업에 대한 메타 평가의 최초 공개 데이터셋인 Counsel을 소개합니다. Counsel은 두 가지 에이전트 벤치마크(고객 지원 에이전트용 tau-bench와 코딩 에이전트용 DA-Code)에서 오픈 가중치 LLMJ가 제공한 프로세스 수준 비판과, 이러한 비판에 대한 인간의 메타 평가로 구성됩니다. 인간 주석자는 플래그가 지정된 각 오류에 대한 비판을 '정확함(spot on)', '위치는 맞지만 추론이 부족함(correct location but poor reasoning)', 또는 '플래그를 지정해서는 안 됨(should not have flagged)'으로 레이블링하여 신뢰할 수 있는 주석자 간 일치도(Krippendorff의 알파 0.78)를 달성했습니다. 결과 데이터셋은 LLMJ 비판을 궤적 내 오류 위치와 추론 품질 모두에 걸친 인간 정렬에 따라 계층화하여, 에이전트용 LLMJ를 보정, 개선 또는 훈련하는 데 유용한 데이터를 제공합니다. 오픈 가중치 평가자를 비교한 결과, 더 우수한 평가 모델과 더 많은 추론 노력 모두 인간 일치도를 향상시켰으며, 가장 강력한 평가자는 위치에 대해 약 88%, 추론에 대해 약 65%의 일치도를 달성했습니다. Counsel은 오픈 가중치 모델을 사용하여 생성되었으며, 광범위한 커뮤니티 사용을 위해 허용적 라이선스로 제공됩니다. 이를 통해 에이전트 시스템을 위한 LLM 기반 평가자의 엄격한 연구와 정렬 개선이 가능해지기를 기대합니다.

English

As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck - human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-judge (LLMJ) to critique agents at the process and outcome-levels at scale, however, the soundness of LLMJ critiques often goes unmeasured. Here, we introduce Counsel, the first public dataset of meta-evaluations for agentic tasks. Counsel consists of process-level critiques from open-weight LLMJs on two agent benchmarks: tau-bench (customer support agents) and DA-Code (coding agents), and human meta-evaluations of these critiques. Human annotators label critiques on each flagged error as "spot on", "correct location but poor reasoning", or "should not have flagged", achieving reliable inter-annotator agreement (Krippendorff's alpha of 0.78). The resulting dataset stratifies LLMJ critiques by human alignment across both error location within a trajectory and reasoning quality, serving as valuable data to calibrate, improve, or train LLMJs for agents. Comparing open-weight judges, we find that more capable judge models and more reasoning effort both enabled improved human agreement, with the strongest judge reaching ~88% agreement on location and ~65% on reasoning. Counsel is generated using open-weight models and is permissively licensed for broad community use, which we hope will enable rigorous study and improved alignment of LLM-based evaluators for agentic systems.