ChatPaper.aiChatPaper

TRIP:利用影像雜訊先驗進行時間殘差學習的影像到影片擴散模型

TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

March 25, 2024
作者: Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei
cs.AI

摘要

最近在文本轉視頻生成方面取得的進展展示了強大擴散模型的實用性。然而,當將擴散模型應用於使靜態圖像動畫化(即圖像轉視頻生成)時,問題並不簡單。困難源於後續動畫幀的擴散過程不僅應保持與給定圖像的忠實對齊,還應追求相鄰幀之間的時間一致性。為了緩解這一問題,我們提出了TRIP,一種新的圖像轉視頻擴散範式,其基於從靜態圖像中導出的圖像噪聲先驅,共同觸發幀間關係推理並通過時間殘差學習來簡化一致的時間建模。從技術上講,圖像噪聲先驅首先通過基於靜態圖像和有噪視頻潛在代碼的單步向後擴散過程獲得。接下來,TRIP通過一種類似殘差的雙路徑方案進行噪聲預測:1)一條捷徑路徑直接將圖像噪聲先驅作為每幀的參考噪聲,以增強第一幀與後續幀之間的對齊;2)一條殘差路徑利用有噪視頻和靜態圖像潛在代碼上的3D-UNet實現幀間關係推理,從而簡化每幀殘差噪聲的學習。此外,每幀的參考和殘差噪聲通過注意機制動態合併以進行最終視頻生成。在WebVid-10M、DTDB和MSR-VTT數據集上進行的大量實驗證明了我們的TRIP在圖像轉視頻生成方面的有效性。請查看我們的項目頁面:https://trip-i2v.github.io/TRIP/。
English
Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal modeling via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.

Summary

AI-Generated Summary

PDF131December 15, 2024