TASTEの問題：エージェントベンチマークのカバレッジと難易度の改善

要旨

エージェントの能力が向上するにつれて、τ^2-Benchのような既存のベンチマークは飽和状態になりつつある。しかし、新しいベンチマークタスクの構築は複雑で、コストと労力を要する。さらに、シナリオをまず自然言語で記述し、その後ツールシーケンスにマッピングする標準的なアプローチでは、エージェントが実行するツール使用パターンのごく一部しか捉えられない。本稿では、タスク構築プロセスを逆転させることでこれらの問題に取り組む。我々はTASTE: Task Synthesis from Tool Sequence Evolution（ツールシーケンス進化からのタスク合成）を提案する。これは、より広範なツール使用をカバーする困難なタスクを自動生成する手法である。TASTEは、LLMが判断した有効性シグナルに基づいて学習された適応型対照nグラムモデルを利用する。これにより、膨大なツール組み合わせをカバーする有効なツールシーケンスをサンプリングできる。次にTASTEは、クラスタリングによりプールから代表的なシーケンスを選択し、それらを完全なベンチマークタスクに具体化し、反復的な難易度進化を通じて洗練する。TASTEを用いて、τ^2-Benchの3ドメインの困難な拡張版であるτ^c-Benchを構築した。11のエージェント/ユーザーLLMペアを評価した結果、τ^2-Benchをほぼ飽和させているモデルでも、我々のタスクでは大幅な性能低下が見られた（例：Gemini-3-Flashは0.82-0.94から0.28-0.61に低下）。難易度の向上に加え、生成されたタスクはエージェントが実行すべき固有のツール組み合わせの数を2倍以上に増加させる。この結果は、既存のベンチマークでの高スコアは、堅牢なタスク解決能力ではなく、飽和を反映していることが多いことを示唆している。困難でカバレッジの高いベンチマークの生成を自動化することにより、TASTEは将来のエージェントの継続的かつスケーラブルな評価を可能にする。

English

As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.