ChatPaper.aiChatPaper

VLA-JEPA:基于潜在世界模型的视觉-语言-动作融合增强框架

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

February 10, 2026
作者: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
cs.AI

摘要

在互联网规模视频数据上预训练视觉-语言-动作策略具有吸引力,但当前潜在动作目标常出现学习偏差:这些方法仍受限于像素变化而非与动作相关的状态转移,导致易受外观偏差、干扰运动和信息泄漏的影响。我们提出VLA-JEPA,一种基于联合嵌入预测架构的预训练框架,其设计能规避上述缺陷。核心思想是无泄漏状态预测:目标编码器从未来帧生成潜在表征,而学生通路仅观测当前状态——未来信息仅作为监督目标,从不作为输入。通过在潜在空间而非像素空间进行预测,VLA-JEPA能学习对相机运动和无关背景变化具有鲁棒性的动态抽象表征。这形成了一种简单的两阶段方案——JEPA预训练后接动作头微调——无需传统潜在动作流程的多阶段复杂性。在LIBERO、LIBERO-Plus、SimplerEnv仿真环境及真实世界操作任务上的实验表明,VLA-JEPA在泛化性和鲁棒性上均优于现有方法。
English
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
PDF120February 12, 2026