运动感知概念对齐,实现一致视频编辑
Motion-Aware Concept Alignment for Consistent Video Editing
June 1, 2025
作者: Tong Zhang, Juan C Leon Alcazar, Bernard Ghanem
cs.AI
摘要
我们推出了MoCA-Video(视频中的运动感知概念对齐),这是一个无需训练即可弥合图像域语义混合与视频之间差距的框架。给定一个生成的视频和用户提供的参考图像,MoCA-Video将参考图像的语义特征注入视频中的特定对象,同时保留原始的运动和视觉上下文。我们的方法利用对角去噪调度和类别无关分割,在潜在空间中检测并跟踪对象,精确控制混合对象的空间位置。为确保时间连贯性,我们引入了基于动量的语义校正和伽马残差噪声稳定化技术,以实现平滑的帧间过渡。我们使用标准SSIM、图像级LPIPS、时间LPIPS评估MoCA的性能,并引入了一个新指标CASS(概念对齐偏移评分)来评估源提示与修改后视频帧之间视觉偏移的一致性和有效性。通过自建数据集,MoCA-Video在无需训练或微调的情况下,超越了现有基线,实现了更优的空间一致性、连贯运动以及显著更高的CASS评分。MoCA-Video证明了在扩散噪声轨迹中进行结构化操控,能够实现可控且高质量的视频合成。
English
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a
training-free framework bridging the gap between image-domain semantic mixing
and video. Given a generated video and a user-provided reference image,
MoCA-Video injects the semantic features of the reference image into a specific
object within the video, while preserving the original motion and visual
context. Our approach leverages a diagonal denoising schedule and
class-agnostic segmentation to detect and track objects in the latent space and
precisely control the spatial location of the blended objects. To ensure
temporal coherence, we incorporate momentum-based semantic corrections and
gamma residual noise stabilization for smooth frame transitions. We evaluate
MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS,
and introduce a novel metric CASS (Conceptual Alignment Shift Score) to
evaluate the consistency and effectiveness of the visual shifts between the
source prompt and the modified video frames. Using self-constructed dataset,
MoCA-Video outperforms current baselines, achieving superior spatial
consistency, coherent motion, and a significantly higher CASS score, despite
having no training or fine-tuning. MoCA-Video demonstrates that structured
manipulation in the diffusion noise trajectory allows for controllable,
high-quality video synthesis.Summary
AI-Generated Summary