TASTE 之重要性:提升智能體基準的覆蓋率與難度
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
May 27, 2026
作者: Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichert
cs.AI
摘要
隨著代理能力的提升,現有基準測試(例如 τ²-Bench)已逐漸趨於飽和。然而,建構新的基準測試任務仍然複雜、成本高昂且勞力密集。此外,標準方法——先以自然語言撰寫情境,再將其映射至工具序列——僅能涵蓋代理所使用之工具模式中的一小部分。本文透過逆向思考任務建構流程來解決這些問題。我們提出 **TASTE**(Task Synthesis from Tool Sequence Evolution,基於工具序列演化之任務合成),這是一種自動化方法,能產生具挑戰性且涵蓋更廣泛工具使用範圍的任務。TASTE 利用一種基於 LLM 判斷有效性訊號訓練的自適應對比 n-gram 模型,用以取樣有效的工具序列,涵蓋大量工具組合。接著,TASTE 透過聚類從序列池中選出具代表性的序列,將其實例化為完整的基準測試任務,並透過反覆的難度演化來優化任務。利用 TASTE,我們建構了 τ^c-Bench,這是在 τ²-Bench 三個領域基礎上的挑戰性擴展。我們評估了 11 組代理/使用者 LLM 配對,結果發現,那些在 τ²-Bench 上幾乎達到飽和的模型,在我們任務上的表現大幅下降(例如,Gemini-3-Flash 從 0.82–0.94 降至 0.28–0.61)。除了增加難度之外,我們生成的任務使代理必須執行的獨特工具組合數量倍增。我們的結果顯示,現有基準測試的高分往往反映的是飽和,而非穩健的問題解決能力。透過自動生成高難度、高覆蓋率的基準測試,TASTE 能夠為未來的代理提供持續且可擴展的評估。
English
As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.