minWM: リアルタイムインタラクティブビデオワールドモデルのためのフルスタックオープンソースフレームワーク

要旨

近年、映像拡散基盤モデルは高品質な動画生成において顕著な進歩を遂げているものの、それらをリアルタイムでインタラクティブなビデオ世界モデルに変換することは依然として困難である。インタラクティブな世界モデルには、制御可能性、因果性、低レイテンシでのロールアウトが求められ、実際にはデータ構築、制御可能なファインチューニング、自己回帰学習、数ステップ蒸留、ストリーミング推論にわたる完全なパイプラインが必要となる。本研究では、リアルタイムインタラクティブビデオ世界モデルを構築するためのフルスタックオープンソースフレームワークであるminWMを提案する。minWMは、既存の双方向T2V/TI2Vビデオ基盤モデルをカメラ制御可能な数ステップ自己回帰世界モデルに変換するエンドツーエンドのパイプラインを提供する。具体的には、minWMはまずカメラ制御を用いて双方向ビデオ拡散モデルをファインチューニングし、次にCausal Forcing / Causal Forcing++パイプライン（AR拡散学習、因果ODEまたは因果一貫性蒸留、非対称DMDを含む）を適用して、低レイテンシロールアウトのための数ステップ自己回帰生成器へと蒸留する。本フレームワークはモジュール型でアーキテクチャの拡張が可能であり、クロスアテンションに基づく条件注入とMMDiTスタイルのアーキテクチャの両方をカバーする代表的なオープンバックボーン（Wan2.1-T2V-1.3BやHY1.5-TI2V-8Bなど）上で具体化している。またminWMは、HY-WorldPlayなどの既存のビデオ世界モデルを新しいデータ分布、学習レシピ、レイテンシ目標に適応させることもサポートする。実行可能なスクリプト、チェックポイント、ドキュメント、推論コードの公開に加え、カメラ軌跡品質、制御可能性学習ステップ、最小バッチサイズ要件に関する実用的なアブレーション研究も提供する。minWMがリアルタイムインタラクティブビデオ世界モデルの構築と適応のための再現可能かつ拡張可能なレシピとして機能することを願っている。プロジェクトページ: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

English

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)