世界即画布：用参考图像、轨迹与文本描绘可操控事件

摘要

我们推出WorldCanvas框架，这是一种可提示世界事件的创新架构，通过融合文本、轨迹与参考图像实现丰富的用户导向式模拟。与纯文本方法及现有轨迹控制的图像-视频生成技术不同，我们的多模态方案将编码运动、时序和可见性的轨迹，与表达语义意图的自然语言、奠定物体视觉特征的参考图像相结合，能够生成包含多智能体交互、物体进出场、参考图像引导的外观呈现及反直觉事件的连贯可控事件。生成视频不仅展现时序连贯性，更具备涌现一致性——在物体暂时消失时仍能保持身份识别与场景稳定性。通过支持富有表现力的世界事件生成，WorldCanvas推动世界模型从被动预测器升级为可交互的用户定制模拟器。项目页面详见：https://worldcanvas.github.io/。

English

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.