ChatPaper.aiChatPaper

Drive-JEPA:视频联合嵌入预测架构与多模态轨迹蒸馏在端到端驾驶中的融合

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

January 29, 2026
作者: Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu
cs.AI

摘要

端到端自动驾驶技术日益广泛地采用自监督视频预训练来学习可迁移的规划表征。然而,目前为场景理解预训练视频世界模型仅能带来有限的性能提升。这一局限因驾驶行为固有的模糊性而加剧:每个场景通常仅提供单一人为轨迹,导致难以学习多模态行为模式。本研究提出Drive-JEPA框架,通过整合视频联合嵌入预测架构与多模态轨迹蒸馏技术实现端到端驾驶。首先,我们将V-JEPA适配于端到端驾驶任务,基于大规模驾驶视频预训练ViT编码器,生成与轨迹规划对齐的预测性表征。其次,我们设计了以提案为中心的规划器,通过动量感知选择机制同时蒸馏模拟器生成的多模态轨迹与人类轨迹,以提升行为策略的稳定性和安全性。在NAVSIM基准测试中,结合简易Transformer解码器的V-JEPA表征在无感知设定下以3个PDMS优势超越现有方法。完整Drive-JEPA框架在v1版本达到93.3 PDMS,v2版本取得87.8 EPDMS,创造了新的性能纪录。
English
End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.
PDF32February 3, 2026