ChatPaper.aiChatPaper

Drive-JEPA:视频联合嵌入预测架构与多模态轨迹蒸馏技术融合实现端到端自动驾驶

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

January 29, 2026
作者: Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu
cs.AI

摘要

端到端自动驾驶系统日益采用自监督视频预训练来学习可迁移的规划表征。然而,目前用于场景理解的视频世界模型预训练仅带来有限性能提升,这一局限因驾驶任务固有的多义性而加剧:每个场景通常仅提供单一人驾轨迹,导致多模态行为学习困难。本研究提出Drive-JEPA框架,通过整合视频联合嵌入预测架构(V-JEPA)与多模态轨迹蒸馏技术实现端到端驾驶。首先,我们将V-JEPA适配于端到端驾驶任务,在大规模驾驶视频上预训练ViT编码器以生成与轨迹规划对齐的预测性表征。其次,我们设计了以提案为中心的规划器,在保留人类轨迹的同时蒸馏模拟器生成的多模态轨迹,并采用动量感知选择机制以提升行为稳定性和安全性。在NAVSIM基准测试中,结合简易Transformer解码器的V-JEPA表征在无感知设定下以3个PDMS优势超越现有方法。完整Drive-JEPA框架在v1版本达到93.3 PDMS,v2版本实现87.8 EPDMS,创造了新的性能纪录。
English
End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.
PDF32February 3, 2026