AsyncTool：評估在多任務場景下的異步函數調用能力

摘要

基於大型語言模型（LLM）的智能體在利用外部工具解決複雜任務方面展現出強大的能力。然而，現有評估往往忽略工具使用的時間維度，尤其是工具回應延遲的影響，且通常僅限於單一任務場景。在實際應用中，多個任務常需並行執行，整體效率取決於智能體能否在等待工具回應時有效利用空閒時間。我們將這種能力稱為「非同步工具調用」。為評估此能力，我們提出了 AsyncTool，這是一個用於評估 LLM 智能體在具延遲工具反饋的互動式多任務工具使用環境中的基準測試。AsyncTool 同時呈現多種異質任務，並在執行過程中模擬真實的工具回應延遲。透過混合資料演化策略，我們構建了一個多樣化的非同步多任務資料集，涵蓋多種場景與工具使用模式。我們在步驟、子任務與任務層級評估模型，並引入以效率為導向的指標來衡量任務協調與完成效率。大量實驗顯示，延遲的工具反饋對當前智能體構成顯著挑戰，並導致效能明顯下降。能更好協調任務切換、依賴追蹤與狀態維護的模型在 AsyncTool 上表現更強。我們的分析揭示了當前使用工具的智能體之關鍵失敗模式，並為設計具有更強時間推理與協調能力的未來系統提供了實用見解。

English

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.