基于分离式正向与逆向动力学预训练的机器人解耦学习

摘要

视觉-语言-动作模型在构建通用机器人方面展现出巨大潜力，但仍面临二维图像预测与三维动作规划错位的困境。此外，这种视觉-动作耦合的训练方式限制了模型从大规模无动作标注的网络视频数据中学习的能力。为解决这些问题，我们提出DeFI创新框架，通过解耦视觉前向与逆向动力学预训练来利用各自数据源，实现视频生成与动作预测的分离。我们引入通用前向动力学模型（GFDM）——基于多样化的机器人与人类视频进行未来帧预测预训练，以及通用逆向动力学模型（GIDM）——通过自监督学习从无标注视频过渡中推断潜在动作。这些模型最终被整合到统一架构中，用于下游任务的端到端微调。通过这种方式，GFDM与GIDM既能各自发挥优势，又能协同增效。在CALVIN ABC-D和SimplerEnv上的大量实验表明，DeFI实现了最先进性能：在CALVIN上达到平均任务长度4.51，在SimplerEnv-Fractal基准测试中获得51.2%的成功率，真实场景部署成功率高达81.3%，显著超越现有方法。

English

Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the General Forward Dynamics Model (GFDM), pretrained on diverse human and robot videos for future prediction, and the General Inverse Dynamics Model (GIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream tasks. In this manner, GFDM and GIDM first shine separately and then cooperate for mutual benefit. Extensive experiments on CALVIN ABC-D and SimplerEnv demonstrate state-of-the-art performance, with DeFI achieving an average task length of 4.51 for CALVIN, 51.2% success rate on SimplerEnv-Fractal benchmark and 81.3% success rate in real-world deployment, significantly outperforming prior methods.