NaviDriveVLM: 자율 주행을 위한 고수준 추론과 모션 계획의 분리

초록

비전-언어 모델(VLM)은 시각 관측, 주행 컨텍스트, 언어 기반 추론을 통합적으로 모델링함으로써 엔드투엔드 자율 주행(AD)의 유망한 방향으로 부상했습니다. 그러나 기존 VLM 기반 시스템은 고수준 추론과 모션 계획 간의 상충 관계에 직면해 있습니다. 대규모 모델은 강력한 의미론적 이해 능력을 제공하지만 정밀한 제어를 위해 적용하는 데 비용이 많이 들고, 반면 소규모 VLM 모델은 효율적으로 미세 조정될 수 있지만 종종 약한 추론 능력을 보입니다. 우리는 대규모 네비게이터와 경량화된 훈련 가능한 드라이버를 사용하여 추론과 행동 생성을 분리하는 분리형 프레임워크인 NaviDriveVLM을 제안합니다. 이 설계는 추론 능력을 보존하고 훈련 비용을 절감하며, 다운스트림 계획을 위한 명시적이고 해석 가능한 중간 표현을 제공합니다. nuScenes 벤치마크에서의 실험 결과, NaviDriveVLM이 엔드투엔드 모션 계획에서 대규모 VLM 기준 모델들을 능가하는 성능을 보였습니다.

English

Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.

NaviDriveVLM: 자율 주행을 위한 고수준 추론과 모션 계획의 분리

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

초록

Support