TransitLM: 지도 없는 대중교통 경로 생성을 위한 대규모 데이터셋 및 벤치마크

초록

대중교통 경로 계획은 전통적으로 구조화된 지도 인프라와 복잡한 경로 탐색 엔진에 의존하며, 기존 데이터셋 중 이러한 의존성을 우회하도록 모델을 훈련시킬 수 있는 것은 존재하지 않는다. 우리는 TransitLM을 소개한다. 이는 120,845개 역과 13,666개 노선을 포괄하는 중국 4개 도시의 1,300만 개 이상의 대중교통 경로 계획 기록으로 구성된 대규모 데이터셋으로, 지속적 사전 훈련 코퍼스이자 상호 보완적인 평가 지표를 갖춘 세 가지 평가 과제를 위한 벤치마크 데이터로 공개되었다. 실험 결과, TransitLM으로 훈련된 LLM이 구조적으로 유효한 경로를 높은 정확도로 생성하며, 명시적인 매핑 없이도 임의의 GPS 좌표를 적절한 역에 암시적으로 정합시킴을 보여준다. 이러한 결과는 대중교통 경로 계획이 데이터로부터 완전히 학습될 수 있음을 입증하며, 출발지-목적지 정보로부터 직접 엔드투엔드 방식의 지도 없는 경로 생성을 가능하게 한다. 데이터셋과 벤치마크는 https://huggingface.co/datasets/GD-ML/TransitLM에서, 평가 코드는 https://github.com/HotTricker/TransitLM에서 확인할 수 있다.

English

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.