Genie Envisioner：面向机器人操作的一体化世界基础平台

摘要

我们推出Genie Envisioner（GE），这是一个面向机器人操作的统一世界基础平台，它将策略学习、评估和仿真集成于单一的视频生成框架内。其核心是GE-Base，一个大规模、指令条件化的视频扩散模型，它在一个结构化的潜在空间中捕捉真实世界机器人交互的空间、时间和语义动态。在此基础上，GE-Act通过一个轻量级的流匹配解码器，将潜在表征映射为可执行的动作轨迹，实现了在多种实体间进行精确且可泛化的策略推理，且只需极少的监督。为了支持可扩展的评估和训练，GE-Sim作为动作条件化的神经模拟器，为闭环策略开发提供高保真的模拟运行。该平台还配备了EWMBench，一个标准化基准套件，用于衡量视觉保真度、物理一致性及指令与动作的对齐程度。这些组件共同确立了Genie Envisioner作为指令驱动、通用型具身智能的可扩展且实用的基础。所有代码、模型和基准测试都将公开发布。

English

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Genie Envisioner：面向机器人操作的一体化世界基础平台

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

摘要

Support