MobilityBench: 実世界のモビリティシナリオにおける経路計画エージェントを評価するベンチマーク

要旨

大規模言語モデル（LLM）を中核とする経路計画エージェントは、自然言語による対話とツールを介した意思決定を通じて、日常的な人間の移動を支援する有望なパラダイムとして登場しました。しかし、実際の移動環境における体系的な評価は、多様な経路需要、非確定的なマッピングサービス、再現性の低さによって妨げられています。本研究では、実世界の移動シナリオにおけるLLMベースの経路計画エージェントを評価するためのスケーラブルなベンチマーク「MobilityBench」を提案します。MobilityBenchはAmapから収集した大規模な匿名化実ユーザークエリから構築され、世界中の複数都市にわたる幅広い経路計画意図を網羅しています。再現性のあるエンドツーエンド評価を可能にするため、ライブサービスからの環境変動を排除した決定論的APIリプレイサンドボックスを設計しました。さらに、結果の有効性を中核とし、指示理解、計画立案、ツール使用、効率性の評価を補完する多次元評価プロトコルを提案します。MobilityBenchを用いて、多様な実世界移動シナリオにおける複数のLLMベース経路計画エージェントを評価し、その動作と性能に関する詳細な分析を提供します。分析結果から、現行のモデルは基本的情報検索と経路計画タスクでは有能に動作するものの、選好条件付き経路計画では著しく苦戦することが明らかとなり、個人化された移動アプリケーションにおける改善余地の大きさが示されました。ベンチマークデータ、評価ツールキット、ドキュメントをhttps://github.com/AMAP-ML/MobilityBench で公開しています。

English

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

MobilityBench: 実世界のモビリティシナリオにおける経路計画エージェントを評価するベンチマーク

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

要旨

Support