NaviDriveVLM：面向自动驾驶的高层推理与运动规划解耦框架

摘要

视觉语言模型（VLMs）通过联合建模视觉观测、驾驶上下文和基于语言的推理，已成为实现端到端自动驾驶（AD）的重要方向。然而，现有基于VLM的系统面临高级推理与运动规划之间的权衡：大模型具备强语义理解能力但难以低成本适配精确控制，而小VLM模型虽可高效微调却常表现出较弱的推理能力。我们提出NaviDriveVLM——一种解耦框架，通过大规模导航器和轻量级可训练驱动器实现推理与动作生成的分离。该设计在保留推理能力的同时降低训练成本，并为下游规划提供可解释的中间表征。在nuScenes基准测试中，NaviDriveVLM在端到端运动规划任务上超越了大型VLM基线模型。

English

Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.

NaviDriveVLM：面向自动驾驶的高层推理与运动规划解耦框架

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

摘要

Support