梦境之地:基于模拟器与生成模型的可控世界构建
Dreamland: Controllable World Creation with Simulator and Generative Models
June 9, 2025
作者: Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, Bolei Zhou
cs.AI
摘要
大规模视频生成模型能够合成多样且逼真的动态世界视觉内容,但往往缺乏元素级别的可控性,这限制了其在场景编辑和具身AI代理训练中的应用。我们提出了Dreamland,一个结合了基于物理的模拟器的精细控制能力与大规模预训练生成模型的光影真实内容输出的混合世界生成框架。特别地,我们设计了一种分层世界抽象,将像素级和对象级的语义与几何信息编码为中间表示,以此桥接模拟器与生成模型。这一方法增强了可控性,通过早期与现实世界分布的对齐最小化了适应成本,并支持即插即用地使用现有及未来的预训练生成模型。此外,我们构建了D3Sim数据集,以促进混合生成管线的训练与评估。实验表明,Dreamland在图像质量上提升了50.8%,在可控性上增强了17.9%,展现出提升具身代理训练的巨大潜力。代码与数据将公开提供。
English
Large-scale video generative models can synthesize diverse and realistic
visual content for dynamic world creation, but they often lack element-wise
controllability, hindering their use in editing scenes and training embodied AI
agents. We propose Dreamland, a hybrid world generation framework combining the
granular control of a physics-based simulator and the photorealistic content
output of large-scale pretrained generative models. In particular, we design a
layered world abstraction that encodes both pixel-level and object-level
semantics and geometry as an intermediate representation to bridge the
simulator and the generative model. This approach enhances controllability,
minimizes adaptation cost through early alignment with real-world
distributions, and supports off-the-shelf use of existing and future pretrained
generative models. We further construct a D3Sim dataset to facilitate the
training and evaluation of hybrid generation pipelines. Experiments demonstrate
that Dreamland outperforms existing baselines with 50.8% improved image
quality, 17.9% stronger controllability, and has great potential to enhance
embodied agent training. Code and data will be made available.