ChatPaper.aiChatPaper

在Veo世界模拟器中评估Gemini机器人策略

Evaluating Gemini Robotics Policies in a Veo World Simulator

December 11, 2025
作者: Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, Allan Zhou
cs.AI

摘要

生成式世界模型在模拟不同环境中视觉运动策略的交互方面具有巨大潜力。前沿视频模型能够以可扩展且通用的方式生成逼真的观测结果和环境交互。然而,视频模型在机器人领域的应用主要局限于分布内评估,即与训练策略或微调基础视频模型时使用的场景相似的场景。本报告证明,视频模型可应用于机器人策略评估的全场景:从标称性能评估到分布外泛化能力测试,再到物理与语义安全性探测。我们基于前沿视频基础模型(Veo)构建了生成式评估系统,该系统经优化可支持机器人动作条件约束与多视角一致性,同时集成生成式图像编辑和多视角补全技术,能够沿多个泛化维度合成真实场景的逼真变体。实验表明,该系统保留了视频模型的基础能力,能精确模拟经过编辑的场景——包括添加新型交互物体、更换视觉背景及引入干扰物体。这种保真度使得系统能够准确预测不同策略在标称条件和分布外条件下的相对性能,确定不同泛化维度对策略性能的影响程度,并对策略进行红队测试以发现违反物理或语义安全约束的行为。我们通过对八种Gemini机器人策略检查点和五项双操作臂任务进行1600余次真实世界评估,验证了这些能力。
English
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
PDF81December 13, 2025