MobilityBench: 실제 이동 시나리오에서 경로 계획 에이전트를 평가하기 위한 벤치마크

초록

대규모 언어 모델(LLM) 기반 경로 계획 에이전트는 자연어 상호작용과 도구 기반 의사 결정을 통해 일상적인 인간 이동성을 지원하는 유망한 패러다임으로 부상했습니다. 그러나 실제 이동 환경에서의 체계적인 평가는 다양한 경로 요구사항, 비결정적 매핑 서비스, 제한된 재현성으로 인해 어려움을 겪고 있습니다. 본 연구에서는 실제 이동 시나리오에서 LLM 기반 경로 계획 에이전트를 평가하기 위한 확장성 있는 벤치마크인 MobilityBench를 소개합니다. MobilityBench는 Amap에서 수집된 대규모의 익명화된 실제 사용자 쿼리로부터 구성되었으며, 전 세계 여러 도시에 걸친 광범위한 경로 계획 의도를 포괄합니다. 재현 가능한 종단 간 평가를 위해, 실시간 서비스로 인한 환경 변동성을 제거하는 결정론적 API 재생 샌드박스를 설계했습니다. 또한 결과 타당성을 중심으로 하며, 지시 이해, 계획 수립, 도구 사용, 효율성 평가를 보완하는 다차원 평가 프로토콜을 제안합니다. MobilityBench를 활용하여 다양한 실제 이동 시나리오에서 여러 LLM 기반 경로 계획 에이전트를 평가하고, 그 동작과 성능에 대한 심층 분석을 제공합니다. 연구 결과에 따르면, 현재 모델들은 기본 정보 검색 및 경로 계획 작업에서는 유능한 성능을 보이지만, 선호도 기반 제약 경로 계획에서는 상당히 어려움을 겪어 개인화된 이동 애플리케이션 분야에서 개선이 필요함이 확인되었습니다. 벤치마크 데이터, 평가 도구 키트 및 문서는 https://github.com/AMAP-ML/MobilityBench 에 공개했습니다.

English

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

MobilityBench: 실제 이동 시나리오에서 경로 계획 에이전트를 평가하기 위한 벤치마크

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

초록

Support