自动化基准测试设计
Automating Benchmark Design
October 28, 2025
作者: Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma
cs.AI
摘要
大型语言模型及其驱动的智能体发展迅猛、应用广泛,其评估能力已滞后于技术发展。当前主要依赖人工构建的静态基准测试来衡量模型能力,但这些测试很快会达到性能饱和。相比之下,动态基准测试能随模型进化而更新,但创建和持续维护成本高昂。为应对这些挑战,我们开发了BeTaL(基于LLM循环调优的基准测试框架),该框架运用环境设计原则实现动态基准测试的自动化构建。BeTaL通过参数化基础测试模板的关键设计选项,利用LLM在参数空间中进行推理,以经济高效的方式达成目标属性(如难度与真实性)。我们通过构建具有预期难度等级的基准测试验证该方法:使用BeTaL创建了两个新基准测试并扩展了主流智能体测试tau-bench。在三个任务及多难度级别的广泛评估中,BeTaL生成的测试难度与目标偏差率介于5.3%-13.2%,较基线方法提升2-4倍。
English
The rapid progress and widespread deployment of LLMs and LLM-powered agents
has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are
the primary tool for assessing model capabilities, but these quickly become
saturated. In contrast, dynamic benchmarks evolve alongside the models they
evaluate, but are expensive to create and continuously update. To address these
challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a
framework that leverages environment design principles to automate the process
of dynamic benchmark design. BeTaL works by parameterizing key design choices
in base benchmark templates and uses LLMs to reason through the resulting
parameter space to obtain target properties (such as difficulty and realism) in
a cost-efficient manner. We validate this approach on its ability to create
benchmarks with desired difficulty levels. Using BeTaL, we create two new
benchmarks and extend a popular agentic benchmark tau-bench. Extensive
evaluation on these three tasks and multiple target difficulty levels shows
that BeTaL produces benchmarks much closer to the desired difficulty, with
average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the
baselines.