PUSA V1.0：通過向量化時間步自適應，以500美元訓練成本超越Wan-I2V

摘要

視頻擴散模型的快速發展一直受到時間建模基本限制的阻礙，尤其是傳統標量時間步變量所施加的幀演變嚴格同步。儘管任務特定的適應和自回歸模型試圖解決這些挑戰，但它們仍受制於計算效率低下、災難性遺忘或適用範圍狹窄等問題。在本研究中，我們提出了普薩（Pusa），這是一種開創性的範式，利用向量化時間步適應（VTA）在統一的視頻擴散框架內實現精細的時間控制。此外，VTA是一種非破壞性的適應，意味著它完全保留了基礎模型的能力。通過使用VTA對最先進的Wan2.1-T2V-14B模型進行微調，我們實現了前所未有的效率——在訓練成本（500 vs. ≥100,000）和數據集大小（4K vs. ≥10M樣本）分別僅為Wan-I2V-14B的1/200和1/2500的情況下，超越了其性能。普薩不僅為圖像到視頻（I2V）生成設定了新標準，達到了VBench-I2V總分87.32%（對比Wan-I2V-14B的86.86%），還解鎖了許多零樣本多任務能力，如起始-結束幀和視頻擴展——所有這些都無需進行任務特定的訓練。同時，普薩仍能執行文本到視頻生成。機制分析表明，我們的方法在保留基礎模型生成先驗的同時，精確地注入了時間動態，避免了向量化時間步固有的組合爆炸。這項工作為下一代視頻合成建立了一個可擴展、高效且多功能的範式，為研究和工業界的高保真視頻生成提供了普及化的可能。代碼已開源於https://github.com/Yaofang-Liu/Pusa-VidGen。

English

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with leq 1/200 of the training cost (\500 vs. \geq 100,000) and leq 1/2500 of the dataset size (4K vs. geq 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

PUSA V1.0：通過向量化時間步自適應，以500美元訓練成本超越Wan-I2V

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

摘要

Support