STR-Match：基于时空相关性得分的无训练视频编辑匹配方法

摘要

以往基于文本引导的视频编辑方法常面临时间不一致性、运动失真以及最为显著的领域转换受限等问题。我们将这些局限归因于编辑过程中对时空像素关联性建模的不足。为此，我们提出了STR-Match，一种无需训练的视频编辑算法，它通过我们新颖的STR评分引导的潜在优化，生成视觉吸引力强且时空连贯的视频。该评分利用文本到视频（T2V）扩散模型中的二维空间注意力和一维时间模块，捕捉相邻帧间的时空像素关联性，而无需计算成本高昂的三维注意力机制。结合潜在优化框架与潜在掩码，STR-Match能够生成时间一致且视觉保真的视频，即便在显著的领域转换下也能保持强劲性能，同时保留源视频的关键视觉特征。大量实验证明，STR-Match在视觉质量和时空一致性方面均持续超越现有方法。

English

Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.

STR-Match：基于时空相关性得分的无训练视频编辑匹配方法

STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

摘要

Support