AsyncTool: マルチタスクシナリオにおける非同期関数呼び出し能力の評価

要旨

大規模言語モデル（LLM）ベースのエージェントは、外部ツールを使用して複雑なタスクを解決する強力な能力を示している。しかし、既存の評価ではツール使用の時間的側面、特にツール応答遅延の影響が軽視されることが多く、通常は単一タスク設定に限られている。現実のアプリケーションでは、複数のタスクが同時に実行される必要があることが多く、全体の効率は、エージェントがツール応答を待つ間にアイドル時間を活用できるかどうかに依存する。我々はこの能力を非同期ツール呼び出しと呼ぶ。これを評価するために、我々はAsyncToolを提案する。これは、遅延のあるツールフィードバックを伴うインタラクティブなマルチタスクツール使用環境において、LLMベースのエージェントを評価するためのベンチマークである。AsyncToolは複数の異種タスクを同時に提示し、実行中に現実的なツール応答遅延をシミュレートする。ハイブリッドデータ進化戦略を用いて、複数のシナリオとツール使用パターンをカバーする多様な非同期マルチタスクデータセットを構築する。我々はモデルをステップ、サブタスク、タスクの各レベルで評価し、タスクの調整と完了効率を測定する効率性重視の指標を導入する。広範な実験により、遅延のあるツールフィードバックは現在のエージェントに深刻な課題をもたらし、明確な性能低下を引き起こすことが示された。タスク切り替え、依存関係追跡、状態維持をよりうまく調整するモデルは、AsyncToolでより強力な性能を発揮する。我々の分析は、現在のツール使用エージェントの主要な障害モードを特定し、より強力な時間的推論と調整能力を持つ将来のシステムを設計するための実践的な示唆を提供する。

English

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.