RynnVLA-002:统一视觉-语言-行动与世界模型
RynnVLA-002: A Unified Vision-Language-Action and World Model
November 21, 2025
作者: Jun Cen, Siteng Huang, Yuqian Yuan, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Kehan Li, Hao Luo, Fan Wang, Xin Li, Deli Zhao, Hao Chen
cs.AI
摘要
我们推出RynnVLA-002——一种统一的视觉-语言-动作模型与世界模型。该世界模型利用动作与视觉输入预测未来图像状态,通过习得环境底层物理规律来优化动作生成;而VLA模型则从图像观测中生成后续动作,既增强视觉理解能力,又支撑世界模型的图像生成。RynnVLA-002的统一框架实现了环境动态特性与行动规划的联合学习。实验表明,RynnVLA-002的性能超越独立的VLA和世界模型,印证了二者的相互增强效应。我们在仿真环境与真实机器人任务中对该模型进行评估:在LIBERO仿真基准测试中,RynnVLA-002无需预训练即达成97.4%的成功率;而在真实世界的LeRobot实验中,其集成世界模型将整体成功率提升50%。
English
We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.