Genie Envisioner: 로봇 매니퓰레이션을 위한 통합 세계 기반 플랫폼

초록

로봇 조작을 위한 통합 세계 기반 플랫폼인 Genie Envisioner(GE)를 소개한다. 이 플랫폼은 정책 학습, 평가, 시뮬레이션을 단일 비디오 생성 프레임워크 내에 통합한다. GE의 핵심인 GE-Base는 대규모의 지시 조건 비디오 확산 모델로, 구조화된 잠재 공간에서 실제 로봇 상호작용의 공간적, 시간적, 의미적 역학을 포착한다. 이 기반 위에 구축된 GE-Act는 경량의 흐름 매칭 디코더를 통해 잠재 표현을 실행 가능한 동작 궤적으로 매핑하여, 최소한의 감독 하에서 다양한 구현체에 걸쳐 정확하고 일반화 가능한 정책 추론을 가능하게 한다. 확장 가능한 평가와 훈련을 지원하기 위해 GE-Sim은 동작 조건 신경 시뮬레이터로 작동하며, 폐루프 정책 개발을 위한 고충실도 롤아웃을 생성한다. 이 플랫폼은 시각적 충실도, 물리적 일관성, 지시-동작 정렬을 측정하는 표준화된 벤치마크 스위트인 EWMBench를 추가로 갖추고 있다. 이러한 구성 요소들은 Genie Envisioner를 지시 주도 범용 구현 지능을 위한 확장 가능하고 실용적인 기반으로 확립한다. 모든 코드, 모델, 벤치마크는 공개될 예정이다.

English

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Genie Envisioner: 로봇 매니퓰레이션을 위한 통합 세계 기반 플랫폼

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

초록

Support