精靈構想者：機器人操控的統一世界基礎平台

摘要

我們介紹Genie Envisioner（GE），這是一個統一的世界基礎平台，專為機器人操作設計，將策略學習、評估和模擬整合於單一的視頻生成框架內。其核心是GE-Base，這是一個大規模、指令條件化的視頻擴散模型，能夠在結構化的潛在空間中捕捉真實世界機器人互動的空間、時間和語義動態。基於此基礎，GE-Act通過一個輕量級的流匹配解碼器，將潛在表示映射為可執行的動作軌跡，從而實現跨多種具身形式的精確且可泛化的策略推斷，並只需極少的監督。為了支持可擴展的評估和訓練，GE-Sim作為一個動作條件化的神經模擬器，生成高保真度的模擬結果，用於閉環策略開發。該平台還配備了EWMBench，這是一個標準化的基準測試套件，用於衡量視覺保真度、物理一致性和指令-動作對齊度。這些組件共同構成了Genie Envisioner，作為一個可擴展且實用的基礎，用於指令驅動的通用具身智能。所有代碼、模型和基準測試將公開釋出。

English

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.