ChatPaper.aiChatPaper

论智能体评估中的随机性因素

On Randomness in Agentic Evals

February 6, 2026
作者: Bjarni Haukur Bjarnason, André Silva, Martin Monperrus
cs.AI

摘要

智能体系统的评估通常基于其在特定环境中完成任务的表现。当前多数研究采用每项任务单次运行的pass@1分数作为性能指标,并假定该指标能可靠反映系统性能。为验证这一假设,我们在SWE-Bench-Verified数据集上收集了涵盖三种模型和两种框架的6万条智能体轨迹数据。研究发现存在显著方差:根据所选运行次数的不同,单次运行的pass@1估值会产生2.2至6.0个百分点的波动,即使在温度参数为0时标准差仍超过1.5个百分点。这种方差具有重要影响:文献中报道的2-3个百分点性能提升可能源于评估噪声而非真实算法进步。通过令牌级分析发现,轨迹差异在早期阶段(通常在前百分之几的令牌处)就已显现,这些微小差异会逐步放大形成不同的解决策略。为确保智能体系统评估的可靠性,我们建议采用三项实践准则:(1)对每项任务进行多次独立运行以计算pass@1估值,尤其在测量微小改进时;(2)运用统计功效分析确定检测预期效应量所需的运行次数;(3)采用k>1的pass@k(乐观边界)和pass^k(悲观边界)等指标,以更全面刻画性能边界。虽然这些实践会增加评估成本,但对于区分真实科学进展与统计噪声至关重要。
English
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.
PDF22February 11, 2026