超越静态排行榜：LLM代理评估中的预测效度

摘要

智能体基准测试发展迅速，但尚无单一基准能覆盖部署场景所暴露出的四至五个维度。本文汇集了迄今规模最大的基于MCP的工业级智能体基准协同深度研究：涵盖新资产类别（包括多模态视觉扩展）、替代编排方案、检索策略、推理模式、基础设施优化及评估方法论探索等十四个并行实现研究。通过整合这些研究及七个既往智能体基准，我们论证总分排行榜系统性地低估了已部署智能体的评估需求——总分排名无法迁移至分布外场景，近期公开-隐藏测试回顾研究为此排名不稳定性提供了直接实证证据。我们提出以预测效度（样本内与样本外排名的相关性）而非样本内均值作为配置排序标准，并建立十二级测量体系，揭示HELM及其后智能体时代评估框架所忽略的部署相关维度。该立场通过三项具有明确阈值的可证伪分布外标准实现操作化：现有证据虽部分支持但尚显薄弱。最后我们提出预注册试点设计方案及下一代智能体基准应报告内容的领域愿景。

English

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.