科幻：帧间插值的对称约束

摘要

帧间插值旨在根據給定的起始幀和結束幀合成中間視頻序列。當前最先進的方法主要通過直接微調或省略訓練來擴展大規模預訓練的圖像到視頻擴散模型（I2V-DMs），以融入結束幀約束。我們發現這些設計中存在一個關鍵限制：它們對結束幀約束的注入通常使用與最初施加起始幀（單一圖像）約束相同的機制。然而，由於原始的I2V-DMs已經充分訓練以適應起始幀條件，通過相同機制引入結束幀約束且僅進行少量（甚至零）專門訓練，可能無法使結束幀對中間內容產生像起始幀那樣強烈的影響。這種兩幀對中間內容控制強度的不對稱性，很可能導致生成幀中出現不一致的運動或外觀崩潰。為了有效實現起始幀和結束幀的對稱約束，我們提出了一個名為Sci-Fi的新框架，該框架對訓練規模較小的約束應用更強的注入。具體而言，它像以前一樣處理起始幀約束，同時通過改進的機制引入結束幀約束。新機制基於一個精心設計的輕量級模塊，名為EF-Net，該模塊僅編碼結束幀並將其擴展為時間自適應的逐幀特徵，注入到I2V-DM中。這使得結束幀約束與起始幀約束一樣強，使我們的Sci-Fi能夠在各種場景中產生更和諧的過渡。大量實驗證明了我們的Sci-Fi相較於其他基線方法的優越性。

English

Frame inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods mainly extend large-scale pre-trained Image-to-Video Diffusion models (I2V-DMs) by incorporating end-frame constraints via directly fine-tuning or omitting training. We identify a critical limitation in their design: Their injections of the end-frame constraint usually utilize the same mechanism that originally imposed the start-frame (single image) constraint. However, since the original I2V-DMs are adequately trained for the start-frame condition in advance, naively introducing the end-frame constraint by the same mechanism with much less (even zero) specialized training probably can't make the end frame have a strong enough impact on the intermediate content like the start frame. This asymmetric control strength of the two frames over the intermediate content likely leads to inconsistent motion or appearance collapse in generated frames. To efficiently achieve symmetric constraints of start and end frames, we propose a novel framework, termed Sci-Fi, which applies a stronger injection for the constraint of a smaller training scale. Specifically, it deals with the start-frame constraint as before, while introducing the end-frame constraint by an improved mechanism. The new mechanism is based on a well-designed lightweight module, named EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. This makes the end-frame constraint as strong as the start-frame constraint, enabling our Sci-Fi to produce more harmonious transitions in various scenarios. Extensive experiments prove the superiority of our Sci-Fi compared with other baselines.