UniDriveVLA：自动驾驶中的理解、感知与行动规划统一框架

摘要

视觉-语言-动作模型近期在自动驾驶领域崭露头角，其通过利用丰富的世界知识来提升驾驶系统认知能力的潜力备受关注。然而，当前该类模型在适应驾驶任务时面临空间感知与语义推理之间的核心矛盾。现有VLA系统不得不做出次优妥协：直接采用2D视觉语言模型会导致空间感知能力受限，而通过3D空间表征增强又会损害其原有的推理能力。我们认为这一矛盾主要源于空间感知与语义推理在共享模型参数中的耦合优化。为此，我们提出UniDriveVLA——基于混合专家Transformer的统一驾驶视觉-语言-动作模型，通过专家解耦机制解决感知与推理的冲突。具体而言，模型包含驾驶理解、场景感知和动作规划三大专家模块，通过掩码联合注意力进行协同。此外，我们结合稀疏感知范式与三阶段渐进式训练策略，在保持语义推理能力的同时提升空间感知性能。大量实验表明，UniDriveVLA在nuScenes开环评估和Bench2Drive闭环评估中均达到最先进水平。该模型在3D检测、在线建图、运动预测及驾驶导向视觉问答等广泛任务中均展现出色性能，凸显其作为自动驾驶统一模型的广泛适用性。代码与模型已发布于https://github.com/xiaomi-research/unidrivevla。

English

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla