ChatPaper.aiChatPaper

MultiCOIN:多模态可控视频插帧技术

MultiCOIN: Multi-Modal COntrollable Video INbetweening

October 9, 2025
作者: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao
cs.AI

摘要

视频插帧技术能够在两幅图像帧之间创建流畅自然的过渡,使其成为视频编辑和长视频合成不可或缺的工具。现有研究在该领域尚无法生成大规模、复杂或精细的运动。特别是,它们难以适应用户意图的多样性,通常缺乏对中间帧细节的精细控制,导致与创意构思不符。为填补这些空白,我们提出了MultiCOIN,一个支持多模态控制的视频插帧框架,包括深度过渡与分层、运动轨迹、文本提示以及用于运动定位的目标区域,同时在灵活性、易用性和精细视频插值的精确度之间实现了平衡。为此,我们采用扩散变换器(DiT)架构作为视频生成模型,因其在生成高质量长视频方面已展现出卓越能力。为确保DiT与我们的多模态控制兼容,我们将所有运动控制映射为一种通用的、用户友好的基于点的稀疏表示,作为视频/噪声输入。此外,考虑到不同控制方式在粒度和影响力上的多样性,我们将内容控制与运动控制分离为两个分支,在引导去噪过程前分别编码所需特征,从而形成两个生成器:一个负责运动,另一个负责内容。最后,我们提出了一种分阶段训练策略,确保模型能够平稳学习多模态控制。大量定性与定量实验表明,多模态控制能够实现更加动态、可定制且上下文准确的视觉叙事。
English
Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
PDF02October 14, 2025