時序上下文微調技術：實現視頻擴散模型的多功能控制

摘要

近期，文本到視頻擴散模型的進展已實現了高質量的視頻合成，但在有限數據和計算資源下，可控生成仍具挑戰性。現有的條件生成微調方法通常依賴於外部編碼器或架構修改，這需要大量數據集，並且通常僅限於空間對齊的條件設置，限制了靈活性和可擴展性。在本研究中，我們提出了時間上下文微調（Temporal In-Context Fine-Tuning, TIC-FT），這是一種高效且多功能的策略，用於將預訓練的視頻擴散模型適應於多樣的條件生成任務。我們的核心思想是沿時間軸將條件幀與目標幀拼接，並插入噪聲水平逐漸增加的中間緩衝幀。這些緩衝幀促進了平滑過渡，使微調過程與預訓練模型的時間動態保持一致。TIC-FT無需改變模型架構，僅需10至30個訓練樣本即可實現強勁性能。我們在多個任務上驗證了該方法，包括圖像到視頻和視頻到視頻生成，使用了如CogVideoX-5B和Wan-14B等大規模基礎模型。大量實驗表明，TIC-FT在條件忠實度和視覺質量上均優於現有基線，同時在訓練和推理過程中保持高效。更多結果，請訪問https://kinam0252.github.io/TIC-FT/。

English

Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/

時序上下文微調技術：實現視頻擴散模型的多功能控制

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

摘要

Support