드림랜드: 시뮬레이터와 생성 모델을 활용한 제어 가능한 세계 창조

초록

대규모 비디오 생성 모델은 동적인 세계 창조를 위해 다양하고 사실적인 시각적 콘텐츠를 합성할 수 있지만, 종종 요소 단위의 제어 가능성이 부족하여 장면 편집 및 구체화된 AI 에이전트 훈련에 사용하기 어렵습니다. 우리는 물리 기반 시뮬레이터의 세밀한 제어와 대규모 사전 학습된 생성 모델의 사실적인 콘텐츠 출력을 결합한 하이브리드 세계 생성 프레임워크인 Dreamland를 제안합니다. 특히, 우리는 픽셀 수준과 객체 수준의 의미론 및 기하학을 중간 표현으로 인코딩하는 계층화된 세계 추상화를 설계하여 시뮬레이터와 생성 모델을 연결합니다. 이 접근 방식은 제어 가능성을 강화하고, 실제 세계 분포와의 초기 정렬을 통해 적응 비용을 최소화하며, 기존 및 미래의 사전 학습된 생성 모델의 즉시 사용을 지원합니다. 또한, 우리는 하이브리드 생성 파이프라인의 훈련 및 평가를 용이하게 하기 위해 D3Sim 데이터셋을 구축했습니다. 실험 결과, Dreamland는 기존 기준선 대비 50.8% 향상된 이미지 품질과 17.9% 강화된 제어 가능성을 보여주며, 구체화된 에이전트 훈련을 크게 개선할 잠재력이 있음을 입증했습니다. 코드와 데이터는 공개될 예정입니다.

English

Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.