PUSA V1.0：通过向量化时间步适应，以500美元训练成本超越Wan-I2V

摘要

视频扩散模型的快速发展一直受限于时间建模中的根本性挑战，尤其是传统标量时间步变量强加的帧演化严格同步问题。尽管任务特定适应和自回归模型试图解决这些难题，但它们仍受制于计算效率低下、灾难性遗忘或应用范围狭窄等局限。本研究提出Pusa，一种革命性的范式，它通过向量化时间步适应（VTA）技术，在统一的视频扩散框架内实现了精细的时间控制。此外，VTA是一种无损适应方法，意味着它完全保留了基础模型的能力。通过在SOTA Wan2.1-T2V-14B模型上应用VTA进行微调，我们实现了前所未有的效率——以leq 1/200的训练成本（\500对比\geq 100,000）和leq 1/2500的数据集规模（4K对比geq 10M样本），超越了Wan-I2V-14B的性能。Pusa不仅为图像到视频（I2V）生成设立了新标准，达到了VBench-I2V总分87.32%（对比Wan-I2V-14B的86.86%），还解锁了多项零样本多任务能力，如起始-结束帧控制及视频扩展——所有这些都无需任务特定训练。同时，Pusa仍能执行文本到视频生成。机制分析表明，我们的方法在保留基础模型生成先验的同时，精准注入了时间动态，避免了向量化时间步固有的组合爆炸问题。本工作为下一代视频合成建立了一个可扩展、高效且多功能的范式，为研究和产业界普及高保真视频生成铺平了道路。代码已开源，详见https://github.com/Yaofang-Liu/Pusa-VidGen。

English

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with leq 1/200 of the training cost (\500 vs. \geq 100,000) and leq 1/2500 of the dataset size (4K vs. geq 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

PUSA V1.0：通过向量化时间步适应，以500美元训练成本超越Wan-I2V

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

摘要

Support