Pix2Gif：運動引導擴散用於 GIF 生成

摘要

我們提出了 Pix2Gif，一種適用於圖像轉 GIF（視頻）生成的運動引導擴散模型。我們通過將任務定義為一個由文本和運動幅度提示引導的圖像翻譯問題，來獨特地解決這個問題，如 teaser 圖所示。為確保模型遵循運動引導，我們提出了一個新的運動引導變形模塊，用於在兩種類型的提示條件下空間轉換源圖像的特徵。此外，我們引入了一個感知損失，以確保轉換後的特徵圖保持在與目標圖像相同的空間中，確保內容一致性和連貫性。為了準備模型訓練，我們通過從 TGIF 視頻標題數據集中提取一致的圖像幀來精心編輯數據，該數據集提供了有關主題的時間變化的豐富信息。在預訓練之後，我們以零樣本方式將我們的模型應用於多個視頻數據集。大量的定性和定量實驗證明了我們模型的有效性 - 它不僅捕捉了來自文本的語義提示，還捕捉了來自運動引導的空間提示。我們使用 16xV100 GPU 的單節點訓練了所有模型。代碼、數據集和模型已在以下網址公開：https://hiteshk03.github.io/Pix2Gif/。

English

We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser fig. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs. Code, dataset and models are made public at: https://hiteshk03.github.io/Pix2Gif/.

Pix2Gif：運動引導擴散用於 GIF 生成

Pix2Gif: Motion-Guided Diffusion for GIF Generation

摘要

Support