MobilityBench：面向现实世界移动场景的路径规划智能体评估基准

摘要

基于大语言模型的路线规划智能体，通过自然语言交互和工具辅助决策支持人类日常出行的新兴范式已展现出广阔前景。然而，现实出行场景中多样化的路线需求、非确定性的地图服务以及有限的可复现性，阻碍了系统性评估的开展。本研究提出MobilityBench——一个面向真实出行场景的可扩展基准测试框架，用于评估基于大语言模型的路线规划智能体。该框架基于从Amap收集的大规模匿名真实用户查询构建，覆盖全球多个城市中广泛存在的路线规划意图。为实现可复现的端到端评估，我们设计了确定性API回放沙箱，消除了实时服务带来的环境变异。我们进一步提出以结果有效性为核心的多维评估方案，辅以指令理解、规划能力、工具使用效率和系统效能评估。通过MobilityBench，我们在多样化真实出行场景下对多款基于大语言模型的路线规划智能体进行评估，并深入解析其行为模式与性能表现。研究发现，当前模型在基础信息检索和常规路线规划任务中表现合格，但在偏好约束路线规划方面存在显著困难，这表明个性化出行应用仍有巨大改进空间。我们已公开基准数据集、评估工具包及技术文档，详见https://github.com/AMAP-ML/MobilityBench。

English

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .