ChatPaper.aiChatPaper

UniUGP:統一理解、生成與規劃的端到端自動駕駛框架

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

December 10, 2025
作者: Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen
cs.AI

摘要

自動駕駛系統因世界知識有限與視覺動態建模能力不足,在長尾場景中表現欠佳。現有基於視覺-語言-動作的方法無法利用未標註影片進行視覺因果學習,而基於世界模型的方法則缺乏大型語言模型的推理能力。本文構建多個專用數據集,為複雜場景提供推理與規劃標註,進而提出名為UniUGP的統一理解-生成-規劃框架,通過混合專家架構協同實現場景推理、未來影片生成與軌跡規劃。該框架整合預訓練的視覺語言模型與影片生成模型,利用視覺動態與語義推理提升規劃性能。系統以多幀觀測數據和語言指令作為輸入,輸出可解釋的思維鏈推理、物理一致的軌跡規劃以及連貫的未來影片預測。我們提出四階段訓練策略,在多個現有自動駕駛數據集及新建專用數據集上逐步構建上述能力。實驗結果表明,該方法在感知、推理與決策方面達到最先進水平,並對具挑戰性的長尾場景展現出卓越的泛化能力。
English
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
PDF102December 13, 2025