在Veo世界模拟器中评估Gemini机器人策略
Evaluating Gemini Robotics Policies in a Veo World Simulator
December 11, 2025
作者: Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, Allan Zhou
cs.AI
摘要
生成式世界模型在模拟不同环境中视觉运动策略的交互方面具有巨大潜力。前沿视频模型能够以可扩展的通用方式生成逼真的观测结果和环境交互。然而,视频模型在机器人领域的应用主要局限于分布内评估,即与训练策略或微调基础视频模型时相似的场景。本报告证明,视频模型可覆盖机器人策略评估的全场景:从标称性能评估到分布外泛化能力测试,乃至物理与语义安全性的探测。我们基于前沿视频基础模型(Veo)构建了生成式评估系统,该系统经优化可支持机器人动作条件约束与多视角一致性,同时集成生成式图像编辑与多视角补全技术,沿多个泛化维度合成真实场景的逼真变体。实验表明,该系统保留了视频模型的基础能力,能精确模拟经编辑后包含新型交互物体、新颖视觉背景及干扰物体的场景。这种保真度使得我们能够准确预测不同策略在标称与分布外条件下的相对性能,确定各泛化维度对策略性能的影响程度,并通过红队测试暴露违反物理或语义安全约束的行为。我们通过对双手机器人执行器进行8个Gemini Robotics策略检查点、5项任务的1600余次现实世界评估,验证了这些能力。
English
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.