论智能体评估中的随机性因素
On Randomness in Agentic Evals
February 6, 2026
作者: Bjarni Haukur Bjarnason, André Silva, Martin Monperrus
cs.AI
摘要
智能体系统通常在任务交互式环境基准测试中进行评估。多数研究采用每项任务单次运行的pass@1分数作为性能指标,并假设其能可靠反映真实水平。我们通过收集SWE-Bench-Verified平台上涵盖三种模型和两种框架的6万条智能体轨迹数据,对这一假设进行检验。结果发现显著方差:根据所选运行批次的不同,单次pass@1评估值会产生2.2至6.0个百分点的波动,即使在温度参数为0时标准差仍超过1.5个百分点。这种方差具有关键影响:文献中报告的2-3个百分点提升可能源于评估噪声而非真实算法进步。通过令牌级分析,我们发现轨迹在早期(通常在前百分之几的令牌处)即发生分化,这些微小差异会逐步累积形成不同的解决策略。为确保智能体系统评估的可靠性,我们建议三项具体实践:(1)基于每项任务的多次独立运行计算pass@1,尤其在测量微小改进时;(2)采用统计功效分析确定检测预期效应量所需的运行次数;(3)考虑使用k>1的pass@k(乐观边界)与pass^k(悲观边界)等指标,以更全面刻画性能边界。虽然这些实践会增加评估成本,但对于区分真实科学进展与统计噪声至关重要。
English
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.