Counsel:面向智能体任务的元评估数据集
Counsel: A Meta-Evaluation Dataset for Agentic Tasks
June 19, 2026
作者: Sashank Pisupati, Henry Broomfield, Eujeong Choi, Antonia Calvi, Charlie Wang, Roman Engeler, Max Bartolo, Patrick Lewis
cs.AI
摘要
随着智能体系统处理日益复杂的多步骤任务,评估其轨迹成为主要瓶颈——在主流智能体基准测试中,人工标注单条轨迹可能需要数小时,这导致评估规模化测量性能或整理训练数据变得困难。这一现状促使学界广泛依赖自动化方法(例如以LLM作为评判者,LLMJ)对智能体的过程与结果进行大规模评估,然而LLMJ评判的可靠性往往未经验证。为此,我们提出Counsel——首个面向智能体任务的元评估公开数据集。该数据集包含基于开源权重LLMJ对两项智能体基准测试(客户支持智能体基准tau-bench与代码智能体基准DA-Code)的过程级评判,以及人类对这些评判的元评估。标注人员对每个标记错误的评判标注为"精准定位"、"定位正确但推理不当"或"不应标记",达到了可靠的标注者间一致性(Krippendorff alpha系数0.78)。通过将LLMJ评判按轨迹中错误定位与推理质量两个维度进行人工对齐分层,该数据集为校准、改进或训练面向智能体的LLMJ提供了宝贵资源。对比不同开源权重评判模型,我们发现更强大的评判模型与更多推理投入均能提升与人类判断的一致性——最强评判模型在错误定位上达到约88%的一致性,在推理质量上达到约65%。Counsel基于开源权重模型生成并采用宽松许可协议,旨在促进社区广泛使用,我们期待它能为智能体系统中基于LLM的评估器提供严谨研究并改善其对齐程度。
English
As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck - human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-judge (LLMJ) to critique agents at the process and outcome-levels at scale, however, the soundness of LLMJ critiques often goes unmeasured. Here, we introduce Counsel, the first public dataset of meta-evaluations for agentic tasks. Counsel consists of process-level critiques from open-weight LLMJs on two agent benchmarks: tau-bench (customer support agents) and DA-Code (coding agents), and human meta-evaluations of these critiques. Human annotators label critiques on each flagged error as "spot on", "correct location but poor reasoning", or "should not have flagged", achieving reliable inter-annotator agreement (Krippendorff's alpha of 0.78). The resulting dataset stratifies LLMJ critiques by human alignment across both error location within a trajectory and reasoning quality, serving as valuable data to calibrate, improve, or train LLMJs for agents. Comparing open-weight judges, we find that more capable judge models and more reasoning effort both enabled improved human agreement, with the strongest judge reaching ~88% agreement on location and ~65% on reasoning. Counsel is generated using open-weight models and is permissively licensed for broad community use, which we hope will enable rigorous study and improved alignment of LLM-based evaluators for agentic systems.