DrivePI:面向自动驾驶统一理解、感知、预测与规划的空间感知型四维多模态大语言模型
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
December 14, 2025
作者: Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在多个领域展现出强大能力,但其在自动驾驶中生成细粒度3D感知与预测输出的应用仍待探索。本文提出DrivePI——一种新颖的空间感知4D MLLM,作为统一的视觉-语言-动作(VLA)框架,同时兼容视觉-动作(VA)模型。我们的方法通过端到端优化并行实现空间理解、3D感知(即3D占据)、预测(即占据流)与规划(即动作输出)。为同时获取精确几何信息与丰富视觉外观,本方案将点云、多视角图像和语言指令整合至统一MLLM架构中。我们进一步开发数据引擎生成用于4D空间理解的文本-占据与文本-流问答对。值得注意的是,仅采用0.5B参数的Qwen2.5模型作为MLLM骨干,DrivePI作为单一统一模型即可匹配或超越现有VLA模型与专用VA模型。具体而言:相较于VLA模型,DrivePI在nuScenes-QA上以2.5%平均准确率超越OpenDriveVLA-7B,在nuScenes数据集上比ORION碰撞率降低70%(从0.37%降至0.11%);相比专用VA模型,DrivePI在OpenOcc上以10.3 RayIoU优势超越FB-OCC的3D占据性能,在OpenOcc上将占据流的mAVE从0.591降至0.509,并在nuScenes规划任务中比VAD降低32%的L2误差(从0.72米降至0.49米)。代码将发布于https://github.com/happinesslz/DrivePI。
English
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI