PRIX: Leren plannen vanuit ruwe pixels voor end-to-end autonoom rijden

Samenvatting

Hoewel end-to-end autonome rijsystemen veelbelovende resultaten laten zien, wordt hun praktische implementatie vaak belemmerd door grote modelgroottes, een afhankelijkheid van dure LiDAR-sensoren en rekenintensieve BEV-feature-representaties. Dit beperkt hun schaalbaarheid, vooral voor massamarktvoertuigen die alleen met camera's zijn uitgerust. Om deze uitdagingen aan te pakken, stellen we PRIX (Plan from Raw Pixels) voor. Onze nieuwe en efficiënte end-to-end rijsarchitectuur werkt uitsluitend met cameragegevens, zonder expliciete BEV-representatie en zonder de noodzaak van LiDAR. PRIX maakt gebruik van een visuele feature-extractor in combinatie met een generatieve planningsmodule om veilige trajecten rechtstreeks vanuit ruwe pixelinvoer te voorspellen. Een kerncomponent van onze architectuur is de Context-aware Recalibration Transformer (CaRT), een nieuwe module die is ontworpen om multi-level visuele features effectief te versterken voor robuustere planning. We tonen door middel van uitgebreide experimenten aan dat PRIX state-of-the-art prestaties behaalt op de NavSim- en nuScenes-benchmarks, waarbij het de mogelijkheden van grotere, multimodale diffusieplanners evenaart terwijl het aanzienlijk efficiënter is wat betreft inferentiesnelheid en modelgrootte, wat het een praktische oplossing maakt voor implementatie in de echte wereld. Ons werk is open-source en de code zal beschikbaar zijn op https://maxiuw.github.io/prix.

English

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

PRIX: Leren plannen vanuit ruwe pixels voor end-to-end autonoom rijden

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Samenvatting

Support