UniDriveVLA: 자율주행을 위한 이해, 인지 및 행동 계획의 통합

초록

비전-언어-행동(VLA) 모델은 최근 자율주행 분야에서 등장하여 풍부한 세계 지식을 활용해 주행 시스템의 인지 능력을 향상시킬 수 있을 것으로 기대되고 있습니다. 그러나 이러한 모델을 주행 작업에 적용하는 데는 현재 공간 인식과 의미론적 추론 사이의 중요한 딜레마가 존재합니다. 그 결과 기존 VLA 시스템은 최적이 아닌 절충안을 선택해야 했습니다: 2D 비전-언어 모델을 직접 도입하면 공간 인식이 제한되고, 3D 공간 표현으로 강화하면 VLM의 본래 추론 능력이 저하되는 경우가 많습니다. 우리는 이 딜레마가 공유된 모델 매개변수 내에서 공간 인식과 의미론적 추론이 결합되어 최적화되기 때문에 발생한다고 주장합니다. 이를 극복하기 위해 우리는 전문가 디커플링을 통해 인지-추론 갈등을 해결하는 Mixture-of-Transformers 기반의 통합 주행 비전-언어-행동 모델인 UniDriveVLA를 제안합니다. 구체적으로 이 모델은 주행 이해, 장면 인지, 행동 계획을 담당하는 세 명의 전문가로 구성되며, 이들은 마스크된 공동 주의(Masked Joint Attention)를 통해 조정됩니다. 또한 의미론적 추론 능력을 유지하면서 공간 인식을 개선하기 위해 희소 인식 패러다임과 3단계 점진적 학습 전략을 결합합니다. 폭넓은 실험을 통해 UniDriveVLA가 nuScenes의 개방형 루프 평가와 Bench2Drive의 폐쇄형 루프 평가에서 최첨단 성능을 달성함을 확인했습니다. 더 나아가 3D 감지, 온라인 매핑, 운동 예측, 주행 중심 VQA를 포함한 다양한 인지, 예측, 이해 작업에서도 강력한 성능을 보여 자율주행 통합 모델로서의 광범위한 적용 가능성을 부각했습니다. 코드와 모델은 https://github.com/xiaomi-research/unidrivevla 에 공개되었습니다.

English

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla

UniDriveVLA: 자율주행을 위한 이해, 인지 및 행동 계획의 통합

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

초록

Support