NaviDriveVLM: 自動運転における高次推論と経路計画の分離

要旨

視覚言語モデル（VLM）は、視覚的観測、運転コンテキスト、言語ベースの推論を統合的にモデル化することで、エンドツーエンドの自動運転（AD）における有望な方向性として登場した。しかし、既存のVLMベースのシステムは、高水準の推論と動作計画の間でトレードオフに直面している。大規模モデルは強力な意味理解を提供するが、精密な制御への適応コストが高く、一方で小型VLMモデルは効率的に微調整可能であるが、往々にして推論能力が弱い。本論文では、大規模ナビゲータと軽量学習可能なドライバを用いて、推論と行動生成を分離する脱結合型フレームワークNaviDriveVLMを提案する。この設計は推論能力を保持し、学習コストを削減し、下流の計画のための明示的で解釈可能な中間表現を提供する。nuScenesベンチマークによる実験では、NaviDriveVLMがエンドツーエンドの動作計画において大規模VLMベースラインを上回ることを示す。

English

Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.

NaviDriveVLM: 自動運転における高次推論と経路計画の分離

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

要旨

Support