自律運転における反射型視覚-言語-行動モデルのための離散拡散

要旨

エンドツーエンド（E2E）ソリューションは、自動運転システムにおける主流のアプローチとして登場し、Vision-Language-Action（VLA）モデルは、Vision-Language Models（VLM）から事前学習されたマルチモーダル知識を活用して複雑な現実世界の環境を解釈し、相互作用する新しいパラダイムを表しています。しかし、これらの手法は、物理的なルールを訓練中に本質的にエンコードするのが難しい模倣学習の制約に依然として縛られています。既存のアプローチは、複雑なルールベースの後処理に依存したり、シミュレーションにほぼ限定された強化学習を採用したり、計算コストの高い勾配計算を必要とする拡散ガイダンスを利用したりすることが多いです。これらの課題に対処するため、我々はReflectDriveを導入します。これは、離散拡散を通じて安全な軌道生成のための反射メカニズムを統合した新しい学習ベースのフレームワークです。まず、2次元の運転空間を離散化してアクションコードブックを構築し、事前学習された拡散言語モデルを微調整して計画タスクに使用できるようにします。我々のアプローチの中核は、勾配計算なしで反復的な自己修正を行う安全意識型の反射メカニズムです。この方法は、多様な運転行動をモデル化するための目標条件付き軌道生成から始まります。これに基づいて、局所探索法を適用して安全でないトークンを特定し、実行可能な解決策を決定し、それらをインペインティングベースの再生成のための安全なアンカーとして使用します。NAVSIMベンチマークで評価されたReflectDriveは、安全クリティカルな軌道生成において大きな利点を示し、自動運転システムのためのスケーラブルで信頼性の高いソリューションを提供します。

English

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.