自动化基准测试设计

摘要

大型语言模型及其驱动的智能体发展迅猛、部署广泛，其评估能力已滞后于技术发展。当前评估模型能力的主要工具是人工构建的静态基准测试，但这些测试会迅速达到性能饱和。相比之下，动态基准测试能随被测模型同步演进，但其创建和持续更新成本高昂。为解决这些挑战，我们开发了BeTaL（基于LLM循环调优的基准测试框架），该框架运用环境设计原理实现动态基准测试设计的自动化。BeTaL通过参数化基础基准模板中的关键设计选项，利用LLM对参数空间进行推理，从而以经济高效的方式获得目标属性（如难度和真实感）。我们通过创建具有预期难度等级的基准测试验证了该方法的有效性：运用BeTaL构建了两个全新基准测试，并扩展了流行的智能体基准tau-bench。针对这三项任务及多个目标难度级别的广泛评估表明，BeTaL生成的基准测试更贴近预期难度，平均偏差范围在5.3%至13.2%之间——较基线方法提升2至4倍。

English

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark tau-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.

自动化基准测试设计

Automating Benchmark Design

摘要

Support