TransitLM：一個用於無地圖公交路線生成的大規模數據集與基準

摘要

公共交通路线规划传统上依赖结构化的地图基础设施和复杂的路由引擎，而现有数据集不支持训练模型以绕过这种依赖。我们提出TransitLM——一个覆盖中国四座城市、包含超过1300万条公共交通路线规划记录的大规模数据集，涵盖120,845个站点和13,666条线路。该数据集以持续预训练语料库和基准数据的形式发布，用于三个评估任务并配有互补指标。实验表明，在TransitLM上训练的大语言模型能够以高准确率生成结构有效的路线，并隐式地将任意GPS坐标映射到合适的站点，而无需任何显式地图。这些结果表明，公共交通路线规划完全可以仅从数据中学习，从而实现直接从起止点信息到路线的端到端、无地图生成。数据集和基准可从 https://huggingface.co/datasets/GD-ML/TransitLM 获取，评估代码见 https://github.com/HotTricker/TransitLM。

English

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.