世界即画布：以参考图像、轨迹与文本描绘可提示事件

摘要

我们推出WorldCanvas——一个可提示世界事件的框架，该框架通过融合文本、运动轨迹和参考图像，实现丰富的用户导向式模拟。与纯文本方法及现有轨迹控制的图像转视频技术不同，我们的多模态方案将编码运动、时序和可见性的轨迹，与表达语义意图的自然语言、奠定物体视觉特征的参考图像相结合，从而生成包含多智能体交互、物体进出场、参考图像引导的外观呈现及反直觉事件的连贯可控事件。生成视频不仅展现时序连贯性，更具备涌现一致性，能在物体暂时消失后仍保持其身份特征与场景稳定性。通过支持富有表现力的世界事件生成，WorldCanvas将世界模型从被动预测器推进为可交互的用户定制模拟器。项目页面详见：https://worldcanvas.github.io/。

English

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

世界即画布：以参考图像、轨迹与文本描绘可提示事件

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

摘要

Support