ViFeEdit：视频扩散变换器的免视频调谐器

摘要

扩散变换器（DiTs）在图像和视频生成领域展现出卓越的可扩展性与生成质量，这促使研究者日益关注其向可控生成与编辑任务的拓展。然而相较于图像领域，视频控制与编辑的进展仍相对有限，主要受限于配对视频数据的稀缺性以及视频扩散模型训练的高计算成本。为解决此问题，本文提出一种无需视频数据的调优框架ViFeEdit，专门针对视频扩散变换器设计。该框架无需任何形式的视频训练数据，仅通过二维图像适配即可实现多样化的视频生成与编辑功能。我们的方法核心在于架构重参数化技术，该技术将现代视频扩散变换器中完整的三维注意力机制解耦为空间独立计算，从而在仅增加极少参量的前提下，既能保持视觉保真度又能确保时序一致性。此外，该设计采用双路径流水线架构，配备独立的噪声调度时间步嵌入，展现出对多样化条件信号的强适应性。大量实验表明，仅通过对二维图像数据进行极简训练，我们的方法即可实现令人满意的可控视频生成与编辑效果。代码已开源：https://github.com/Lexie-YU/ViFeEdit。

English

Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.