ChatPaper.aiChatPaper

LAMP:學習少樣本影片生成的動作模式

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

October 16, 2023
作者: Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu Zhang
cs.AI

摘要

隨著基於擴散的文本轉圖像生成技術取得令人印象深刻的進展,將這種強大的生成能力擴展到文本轉視頻引起了廣泛關注。現有方法要麼需要大規模的文本-視頻配對數據和大量訓練資源,要麼學習與模板視頻精確對齊的運動。在視頻生成中平衡生成自由度與資源成本之間的取捨並不簡單。在我們的研究中,我們提出了一個基於少樣本調整的框架 LAMP,該框架使得文本轉圖像擴散模型能夠在單個 GPU 上通過 8~16 個視頻學習特定運動模式。具體而言,我們設計了一個以第一幀為條件的流程,該流程使用現成的文本轉圖像模型進行內容生成,使我們調整的視頻擴散模型主要集中於運動學習。成熟的文本轉圖像技術可以提供視覺上令人愉悅和多樣化的內容作為生成條件,這極大地提高了視頻質量和生成自由度。為了捕捉時間維度的特徵,我們將預訓練的 2D 卷積層擴展為我們的新型時空運動學習層,並將注意力塊修改為時間級別。此外,我們開發了一種有效的推斷技巧,即共享噪聲抽樣,可以提高視頻的穩定性並降低計算成本。我們的方法還可以靈活應用於其他任務,例如現實世界圖像動畫和視頻編輯。大量實驗表明,LAMP 能夠有效地從有限數據中學習運動模式並生成高質量的視頻。代碼和模型可在 https://rq-wu.github.io/projects/LAMP 上獲得。
English
With the impressive progress in diffusion-based text-to-image generation, extending such powerful generative ability to text-to-video raises enormous attention. Existing methods either require large-scale text-video pairs and a large number of training resources or learn motions that are precisely aligned with template videos. It is non-trivial to balance a trade-off between the degree of generation freedom and the resource costs for video generation. In our study, we present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and generation freedom. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective inference trick, shared-noise sampling, which can improve the stability of videos with computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP.
PDF92December 15, 2024