MobilityBench:一個用於評估現實世界移動情境中路線規劃代理的基準
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
February 26, 2026
作者: Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu
cs.AI
摘要
基於大型語言模型(LLM)的路線規劃智能體已成為一種前景廣闊的範式,能通過自然語言互動和工具介導的決策來支持人類日常出行。然而,由於多樣化的路線需求、非確定性的地圖服務以及有限的可重現性,在真實出行場景中進行系統性評估仍面臨挑戰。本研究提出MobilityBench——一個可擴展的基準測試框架,用於評估基於LLM的路線規劃智能體在真實出行場景中的表現。該框架基於從高德地圖收集的大規模匿名真實用戶查詢構建,涵蓋全球多個城市中廣泛的路線規劃意圖。為實現可重現的端到端評估,我們設計了確定性的API回放沙箱,消除了實時服務帶來的環境變異性。我們進一步提出以結果有效性為核心的多維度評估方案,並輔以指令理解、規劃能力、工具使用效率和執行效能等評估維度。通過MobilityBench,我們在多樣化真實出行場景中評估了多種基於LLM的路線規劃智能體,並對其行為模式與性能表現進行了深入分析。研究發現,當前模型在基礎信息檢索和標準路線規劃任務中表現合格,但在偏好約束路線規劃方面存在明顯不足,這凸顯了個性化出行應用領域仍有巨大改進空間。我們已公開釋出基準數據集、評估工具包及相關文檔,詳見https://github.com/AMAP-ML/MobilityBench。
English
Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .