AsyncTool: 멀티태스크 시나리오에서의 비동기 함수 호출 능력 평가

초록

대규모 언어 모델 기반 에이전트는 외부 도구를 활용하여 복잡한 작업을 해결하는 데 강력한 성능을 보여주고 있다. 그러나 기존 평가는 도구 사용의 시간적 차원, 특히 도구 응답 지연 시간의 영향을 간과하는 경우가 많으며, 대개 단일 작업 환경으로 제한된다. 실제 응용에서는 여러 작업을 동시에 실행해야 하는 경우가 많으며, 전체 효율성은 에이전트가 도구 응답을 기다리는 동안 유휴 시간을 활용할 수 있는지 여부에 달려 있다. 우리는 이러한 능력을 비동기적 도구 호출이라고 칭한다. 이를 평가하기 위해, 지연된 도구 피드백이 있는 대화형 다중 작업 도구 사용 환경에서 대규모 언어 모델 기반 에이전트를 평가하기 위한 벤치마크인 AsyncTool을 제안한다. AsyncTool은 여러 이질적인 작업을 동시에 제시하고, 실행 중 현실적인 도구 응답 지연 시간을 시뮬레이션한다. 하이브리드 데이터 진화 전략을 사용하여 다양한 시나리오와 도구 사용 패턴을 포괄하는 다양한 비동기 다중 작업 데이터셋을 구축한다. 우리는 단계, 하위 작업, 작업 수준에서 모델을 평가하고, 작업 조정 및 완료 효율성을 측정하기 위한 효율성 중심 지표를 도입한다. 광범위한 실험 결과, 지연된 도구 피드백은 현재 에이전트에 상당한 도전을 제기하며 명백한 성능 저하를 초래한다는 것이 밝혀졌다. 작업 전환, 의존성 추적, 상태 유지를 더 잘 조정하는 모델이 AsyncTool에서 더 강력한 성능을 보인다. 우리의 분석은 현재 도구 사용 에이전트의 주요 실패 모드를 식별하고, 향후 더 강력한 시간 추론 및 조정 능력을 갖춘 시스템 설계를 위한 실용적인 통찰력을 제공한다.

English

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.