MultiCOIN:多模態可控視頻插幀技術
MultiCOIN: Multi-Modal COntrollable Video INbetweening
October 9, 2025
作者: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao
cs.AI
摘要
視頻插幀技術在兩幅圖像幀之間創造出流暢自然的過渡效果,使其成為視頻編輯和長視頻合成中不可或缺的工具。現有領域的研究尚無法生成大規模、複雜或精細的運動。特別是,它們難以適應用戶意圖的多樣性,通常缺乏對中間幀細節的精細控制,導致與創意構思不符。為填補這些空白,我們推出了MultiCOIN,這是一個支持多模態控制的視頻插幀框架,包括深度過渡與分層、運動軌跡、文本提示以及用於運動定位的目標區域,同時在靈活性、易用性和精細視頻插值的精確度之間取得平衡。為實現這一目標,我們採用擴散變換器(DiT)架構作為視頻生成模型,因其在生成高質量長視頻方面已展現出卓越能力。為確保DiT與我們的多模態控制兼容,我們將所有運動控制映射為一種通用的、用戶友好的基於點的稀疏表示,作為視頻/噪聲輸入。此外,為尊重不同控制方式在粒度和影響力上的多樣性,我們將內容控制與運動控制分為兩個分支,在引導去噪過程前分別編碼所需特徵,從而形成兩個生成器:一個負責運動,另一個負責內容。最後,我們提出了一種分階段訓練策略,確保模型能平穩學習多模態控制。大量的定性與定量實驗證明,多模態控制能夠實現更為動態、可定制且語境準確的視覺敘事。
English
Video inbetweening creates smooth and natural transitions between two image
frames, making it an indispensable tool for video editing and long-form video
synthesis. Existing works in this domain are unable to generate large, complex,
or intricate motions. In particular, they cannot accommodate the versatility of
user intents and generally lack fine control over the details of intermediate
frames, leading to misalignment with the creative mind. To fill these gaps, we
introduce MultiCOIN, a video inbetweening framework that allows multi-modal
controls, including depth transition and layering, motion trajectories, text
prompts, and target regions for movement localization, while achieving a
balance between flexibility, ease of use, and precision for fine-grained video
interpolation. To achieve this, we adopt the Diffusion Transformer (DiT)
architecture as our video generative model, due to its proven capability to
generate high-quality long videos. To ensure compatibility between DiT and our
multi-modal controls, we map all motion controls into a common sparse and
user-friendly point-based representation as the video/noise input. Further, to
respect the variety of controls which operate at varying levels of granularity
and influence, we separate content controls and motion controls into two
branches to encode the required features before guiding the denoising process,
resulting in two generators, one for motion and the other for content. Finally,
we propose a stage-wise training strategy to ensure that our model learns the
multi-modal controls smoothly. Extensive qualitative and quantitative
experiments demonstrate that multi-modal controls enable a more dynamic,
customizable, and contextually accurate visual narrative.