TransitLM：用于无地图公交路线生成的大规模数据集与基准

摘要

公共交通路线规划传统上依赖于结构化的地图基础设施和复杂的路由引擎，而现有数据集无法支持模型绕过这一依赖进行训练。我们提出了TransitLM，这是一个包含来自中国四个城市超过1300万条公交路线规划记录的大规模数据集，涵盖120,845个站点和13,666条线路。该数据集作为持续预训练语料库发布，同时提供包含三项评估任务及互补指标的基准数据。实验表明，在TransitLM上训练的大型语言模型能够以高精度生成结构有效的路线，并能在无需任何显式映射的情况下，将任意GPS坐标隐式地关联到合适的站点。这些结果表明，公交路线规划可以完全从数据中学习，从而实现直接从起讫点信息进行端到端、无地图的路线生成。数据集及基准测试可在 https://huggingface.co/datasets/GD-ML/TransitLM 获取，评估代码见 https://github.com/HotTricker/TransitLM。

English

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.