运动感知概念对齐，实现一致视频编辑

摘要

我们推出了MoCA-Video（视频中的运动感知概念对齐），这是一个无需训练即可弥合图像域语义混合与视频之间差距的框架。给定一个生成的视频和用户提供的参考图像，MoCA-Video将参考图像的语义特征注入视频中的特定对象，同时保留原始的运动和视觉上下文。我们的方法利用对角去噪调度和类别无关分割，在潜在空间中检测并跟踪对象，精确控制混合对象的空间位置。为确保时间连贯性，我们引入了基于动量的语义校正和伽马残差噪声稳定化技术，以实现平滑的帧间过渡。我们使用标准SSIM、图像级LPIPS、时间LPIPS评估MoCA的性能，并引入了一个新指标CASS（概念对齐偏移评分）来评估源提示与修改后视频帧之间视觉偏移的一致性和有效性。通过自建数据集，MoCA-Video在无需训练或微调的情况下，超越了现有基线，实现了更优的空间一致性、连贯运动以及显著更高的CASS评分。MoCA-Video证明了在扩散噪声轨迹中进行结构化操控，能够实现可控且高质量的视频合成。

English

We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

运动感知概念对齐，实现一致视频编辑

Motion-Aware Concept Alignment for Consistent Video Editing

摘要

Support