PRIX：從原始像素學習規劃以實現端到端自動駕駛

摘要

儘管端到端自動駕駛模型展現出令人期待的成果，但其實際部署往往受到大型模型體積、對昂貴LiDAR感測器的依賴以及計算密集的鳥瞰圖（BEV）特徵表示的限制。這尤其影響了僅配備相機的大眾市場車輛的可擴展性。為應對這些挑戰，我們提出了PRIX（從原始像素規劃）。我們新穎且高效的端到端駕駛架構僅使用相機數據運作，無需顯式的BEV表示，也無需LiDAR。PRIX結合視覺特徵提取器與生成式規劃頭，直接從原始像素輸入預測安全軌跡。我們架構的核心組件是上下文感知重校準變壓器（CaRT），這是一個新穎模塊，旨在有效增強多層次視覺特徵，以實現更穩健的規劃。通過全面實驗，我們證明PRIX在NavSim和nuScenes基準測試中達到了最先進的性能，與更大規模、多模態的擴散規劃器能力相當，同時在推理速度和模型大小方面顯著更高效，使其成為現實世界部署的實用解決方案。我們的工作是開源的，代碼將發佈於https://maxiuw.github.io/prix。

English

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

PRIX：從原始像素學習規劃以實現端到端自動駕駛

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

摘要

Support