TASTE의 문제: 에이전트 벤치마크의 적용 범위와 난이도 향상

초록

에이전트의 성능이 향상됨에 따라 τ²-Bench 같은 기존 벤치마크는 점점 더 포화 상태에 이르고 있다. 그러나 새로운 벤치마크 과제를 구축하는 일은 여전히 복잡하고 비용이 많이 들며 노동 집약적이다. 게다가 시나리오를 먼저 자연어로 작성한 후 이를 도구 시퀀스로 매핑하는 표준 접근 방식은 에이전트가 사용하는 도구 사용 패턴의 좁은 하위 집합만을 포착한다. 본 논문에서는 과제 구성 과정을 역전시켜 이러한 문제를 해결한다. 우리는 TASTE: Task Synthesis from Tool Sequence Evolution을 제안한다. 이는 더 넓은 도구 사용 범위를 갖춘 도전적인 과제를 자동으로 생성하는 방법이다. TASTE는 LLM이 판단한 유효성 신호를 기반으로 훈련된 적응형 대조 n-그램 모델을 활용한다. 이를 통해 광범위한 도구 조합을 포괄하는 유효한 도구 시퀀스를 샘플링할 수 있다. 그런 다음 TASTE는 클러스터링을 통해 풀에서 대표적인 시퀀스를 선별하고, 이를 완전한 벤치마크 과제로 구체화한 후 반복적인 난이도 진화를 통해 정제한다. TASTE를 사용하여 τ²-Bench의 세 가지 도메인에 대한 도전적인 확장판인 τ^c-Bench를 구축한다. 우리는 11개의 에이전트/사용자 LLM 쌍을 평가한 결과, τ²-Bench를 거의 포화시킨 모델들이 우리의 과제에서 심각한 성능 저하를 겪는 것을 발견했다(예: Gemini-3-Flash가 0.82!-!0.94에서 0.28!-!0.61로 하락). 난이도 증가 외에도 우리가 생성한 과제는 에이전트가 실행해야 하는 고유한 도구 조합의 수를 두 배 이상 늘린다. 우리의 결과는 기존 벤치마크에서의 높은 점수가 종종 견고한 과제 해결 능력보다는 포화 상태를 반영한다는 것을 시사한다. TASTE는 어렵고 적용 범위가 넓은 벤치마크의 생성을 자동화함으로써 미래 에이전트의 지속적이고 확장 가능한 평가를 가능하게 한다.

English

As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.