ChatPaper.aiChatPaper

TASTE问题:提升智能体基准测试的覆盖范围与难度

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

May 27, 2026
作者: Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichert
cs.AI

摘要

随着智能体能力的不断提升,现有的基准测试(如τ²-Bench)正逐渐趋于饱和。然而,构建新的基准任务仍然复杂、昂贵且劳动密集。此外,标准的任务构建方法——先用自然语言描述场景,再将其映射为工具序列——只能覆盖智能体实际使用的工具模式中的一小部分。本文通过逆向构建任务流程来解决这些问题。我们提出了TASTE:基于工具序列演化的任务合成方法,这是一种能够自动生成具有更广泛工具使用覆盖率的挑战性任务的方法。TASTE利用了一个基于LLM评判的有效性信号训练的适应性对比n-gram模型,从而能够采样出覆盖大量工具组合的有效工具序列。接着,TASTE通过聚类从这些序列中选取代表性样本,将它们实例化为完整的基准任务,并通过迭代的难度演化进行优化。利用TASTE,我们构建了τ^c-Bench,这是对τ²-Bench三个领域的挑战性扩展。我们评估了11组智能体/用户LLM对,发现那些在τ²-Bench上几乎达到饱和的模型在我们的任务中遭遇了严重的性能下降(例如,Gemini-3-Flash从0.82~0.94降至0.28~0.61)。除了提高难度之外,我们生成的任务使智能体必须执行的独特工具组合数量增加了一倍以上。我们的结果表明,在现有基准测试上的高分往往反映了饱和现象,而非稳健的任务解决能力。通过自动化生成高难度、高覆盖率的基准测试,TASTE为未来智能体的持续、可扩展评估提供了可能。
English
As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.