世界はあなたのキャンバス：参照画像、軌跡、テキストを用いたプロンプト可能なイベントの描画

要旨

WorldCanvasを紹介します。これはプロンプト可能な世界イベントのフレームワークであり、テキスト、軌跡、参照画像を組み合わせることで、ユーザー主導の豊かなシミュレーションを実現します。テキストのみのアプローチや既存の軌跡制御画像動画生成手法とは異なり、私たちのマルチモーダルアプローチは、動き・タイミング・可視性を符号化する「軌跡」を、意味的意図を表す自然言語、およびオブジェクトの同一性を視覚的に接地する参照画像と組み合わせます。これにより、複数エージェントの相互作用、オブジェクトの出現/消失、参照画像に基づく外見、直感に反する事象を含む、一貫性と制御性を備えたイベント生成が可能になります。生成される動画は時間的コヒーレンスに加えて、一時的な消失後もオブジェクトの同一性やシーンが維持される「創発的一貫性」を示します。表現力豊かな世界イベント生成をサポートするWorldCanvasは、世界モデルを受動的な予測器から、ユーザーが形作る対話型シミュレータへと進化させます。プロジェクトページは以下で公開されています：https://worldcanvas.github.io/

English

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

世界はあなたのキャンバス：参照画像、軌跡、テキストを用いたプロンプト可能なイベントの描画

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

要旨

Support