ChatPaper.aiChatPaper

UniUGP:面向端到端自动驾驶的理解、生成与规划统一框架

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

December 10, 2025
作者: Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen
cs.AI

摘要

自动驾驶系统因世界知识有限和视觉动态建模能力不足,在长尾场景中表现不佳。现有基于视觉-语言-动作的方法无法利用未标记视频进行视觉因果学习,而基于世界模型的方法缺乏大语言模型的推理能力。本文构建了多个专用数据集,为复杂场景提供推理与规划标注,进而提出名为UniUGP的统一理解-生成-规划框架,通过混合专家架构协同实现场景推理、未来视频生成和轨迹规划。通过整合预训练的视觉语言模型与视频生成模型,UniUGP利用视觉动态和语义推理增强规划性能。该框架以多帧观测数据和语言指令为输入,生成可解释的思维链推理、物理一致的轨迹以及连贯的未来视频。我们提出四阶段训练策略,在多个现有自动驾驶数据集及自建专用数据集上逐步构建这些能力。实验表明,该方法在感知、推理和决策方面达到最先进水平,并对具有挑战性的长尾场景展现出卓越的泛化能力。
English
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
PDF102December 13, 2025