UniDriveVLA：自律走行における理解、知覚、行動計画の統合

要旨

Vision-Language-Action (VLA) モデルは近年、自動運転分野に登場し、豊富な世界知識を活用して運転システムの認知能力を向上させる可能性を約束している。しかし、このようなモデルを運転タスクに適応させる際、現在、空間知覚と意味推論の間で重大なジレンマに直面している。その結果、既存のVLAシステムは最適ではない妥協を強いられている：2D Vision-Languageモデルを直接採用すると空間知覚が限定的となり、一方で3D空間表現で強化すると、往々にしてVLMの本来の推論能力が損なわれるのである。我々は、このジレンマが主に、共有されたモデルパラメータ内での空間知覚と意味推論の結合最適化に起因していると主張する。これを克服するため、我々はMixture-of-Transformersに基づく統一運転Vision-Language-Actionモデル、UniDriveVLAを提案する。これは、専門家の分離を通じて知覚と推論の衝突に対処する。具体的には、運転理解、シーン知覚、行動計画の3つの専門家で構成され、マスクされた結合アテンションを通じて調整される。さらに、空間知覚を向上させながら意味推論能力を維持するために、スパース知覚パラダイムと三段階の段階的学習戦略を組み合わせる。大規模な実験により、UniDriveVLAがnuScenesにおけるオープンループ評価およびBench2Driveにおけるクローズドループ評価で最先端の性能を達成することが示された。さらに、3D検出、オンライン地図生成、動き予測、運転指向VQAなど、幅広い知覚、予測、理解タスクにおいて強力な性能を示し、自動運転のための統一モデルとしての幅広い適用可能性を強調している。コードとモデルはhttps://github.com/xiaomi-research/unidrivevla で公開されている。

English

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla

UniDriveVLA：自律走行における理解、知覚、行動計画の統合

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

要旨

Support