增强复杂动作视频生成的运动控制

摘要

现有的文本到视频（T2V）模型通常难以生成具有足够明显或复杂动作的视频。一个关键限制在于文本提示无法精确传达复杂运动细节。为了解决这个问题，我们提出了一个新颖的框架，MVideo，旨在生成具有精确流畅动作的长时视频。MVideo通过将蒙版序列作为额外的运动条件输入，克服了文本提示的限制，提供了更清晰、更准确地表示预期动作的方法。利用GroundingDINO和SAM2等基础视觉模型，MVideo自动生成蒙版序列，提升了效率和鲁棒性。我们的结果表明，在训练后，MVideo有效地将文本提示与运动条件对齐，生成同时满足两个标准的视频。这种双重控制机制通过允许独立修改文本提示或运动条件，或同时修改两者，实现了更动态的视频生成。此外，MVideo支持运动条件的编辑和组合，促进生成具有更复杂动作的视频。因此，MVideo推动了T2V运动生成，为当前视频扩散模型中改进动作描绘设定了强有力的基准。我们的项目页面可在https://mvideo-v1.github.io/找到。

English

Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt's inability to precisely convey intricate motion details. To address this, we propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions. MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input, providing a clearer, more accurate representation of intended actions. Leveraging foundational vision models such as GroundingDINO and SAM2, MVideo automatically generates mask sequences, enhancing both efficiency and robustness. Our results demonstrate that, after training, MVideo effectively aligns text prompts with motion conditions to produce videos that simultaneously meet both criteria. This dual control mechanism allows for more dynamic video generation by enabling alterations to either the text prompt or motion condition independently, or both in tandem. Furthermore, MVideo supports motion condition editing and composition, facilitating the generation of videos with more complex actions. MVideo thus advances T2V motion generation, setting a strong benchmark for improved action depiction in current video diffusion models. Our project page is available at https://mvideo-v1.github.io/.

增强复杂动作视频生成的运动控制

Motion Control for Enhanced Complex Action Video Generation

摘要

Support