DreamX-World 1.0: 범용 상호작용 세계 모델

초록

DreamX-World 1.0은 제어 가능한 장시간 생성을 위한 범용 대화형 텍스트/이미지-비디오 세계 모델입니다. 이 모델은 카메라 탐색, 이전에 관찰된 영역 재방문, 포토리얼리스틱, 게임 스타일 및 스타일화된 도메인 전반에 걸친 프롬프트 가능 이벤트를 지원합니다. 데이터 엔진은 카메라 정확한 언리얼 엔진 렌더링, 행동이 풍부한 게임플레이 녹화, 복원된 카메라 기하학을 갖춘 실제 비디오를 결합합니다. 카메라 제어를 위해, PRoPE의 투영 카메라 기하학을 유지하면서 공간적으로 축소된 토큰에 카메라 인식 어텐션을 적용하는 경량 투영 위치 인코딩 변형인 E-PRoPE를 도입합니다. 양방향 비디오 생성기를 인과 강제, DMD 스타일 증류, 롱 롤아웃 훈련을 사용하여 소수 단계 자기회귀 세계 모델로 변환합니다. 자기 생성된 장시간 맥락에서의 훈련은 모델을 자체 생성된 히스토리에 노출시키고, 자기회귀 청크 간에 누적되는 스타일 및 색상 드리프트를 줄입니다. 메모리 조건 장면 지속성은 카메라 기하학 기반 검색을 통해 이전 뷰를 검색하며, 잔차 재활용은 조건화 경로가 불완전한 메모리 잠재변수에 덜 민감하도록 만듭니다. 이벤트 명령 튜닝은 구성 가능한 이벤트 제어를 추가하고, 강화 학습 정렬은 증류 후 카메라 제어와 시각적 품질을 복원합니다. 혼합 정밀도 DiT 실행, 잔차 재사용, 75% 가지치기된 VAE 디코딩 및 비동기 파이프라인 병렬 처리를 통해 DreamX-World 1.0은 8개의 RTX 5090 GPU에서 최대 16FPS에 도달합니다. 5초 기본 평가에서 DreamX-World 1.0은 카메라 제어 점수 73.75, 종합 점수 84.76을 달성하여, 각각 80.79와 80.45를 기록한 HY-WorldPlay 1.5 및 LingBot-World를 종합 점수에서 능가합니다.

English

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.