VitaBench: 실세계 애플리케이션에서 다채로운 상호작용 과제를 통해 LLM 에이전트 벤치마킹

초록

LLM 기반 에이전트가 실제 생활 시나리오에 점점 더 많이 배포됨에 따라, 기존 벤치마크는 방대한 정보 처리, 다양한 자원 활용, 그리고 동적인 사용자 상호작용 관리와 같은 본질적인 복잡성을 포착하지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 VitaBench을 소개합니다. VitaBench은 실제 세계 설정에 기반한 다양한 상호작용 작업에서 에이전트를 평가하는 도전적인 벤치마크입니다. 음식 배달, 매장 내 소비, 온라인 여행 서비스와 같은 일상적인 애플리케이션에서 영감을 받아, VitaBench은 66개의 도구로 구성된 지금까지 가장 복잡한 생활 서비스 시뮬레이션 환경을 에이전트에게 제공합니다. 도메인 특정 정책을 제거한 프레임워크를 통해, 이러한 시나리오와 도구의 유연한 구성을 가능하게 하여 100개의 교차 시나리오 작업(주요 결과)과 300개의 단일 시나리오 작업을 생성합니다. 각 작업은 여러 실제 사용자 요청에서 파생되며, 에이전트가 시간적 및 공간적 차원을 넘어 추론하고, 복잡한 도구 세트를 활용하며, 모호한 지시를 적극적으로 명확히 하고, 다중 턴 대화 전반에 걸쳐 변화하는 사용자 의도를 추적할 것을 요구합니다. 또한, 우리는 루브릭 기반 슬라이딩 윈도우 평가자를 제안하여, 복잡한 환경과 확률적 상호작용에서 다양한 해결 경로를 강력하게 평가할 수 있도록 합니다. 우리의 포괄적인 평가는 가장 진보된 모델조차 교차 시나리오 작업에서 30%의 성공률을, 다른 작업에서는 50% 미만의 성공률을 달성한다는 것을 보여줍니다. 전반적으로, 우리는 VitaBench이 실제 세계 애플리케이션에서 AI 에이전트 개발을 진보시키는 데 유용한 자원으로 기능할 것이라고 믿습니다. 코드, 데이터셋, 리더보드는 https://vitabench.github.io/에서 확인할 수 있습니다.

English

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/

VitaBench: 실세계 애플리케이션에서 다채로운 상호작용 과제를 통해 LLM 에이전트 벤치마킹

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

초록

Support