STR-Match：時空相關性評分匹配的無訓練視頻編輯

摘要

以往的文本引导视频编辑方法常面临时间不一致性、运动失真以及最为显著的领域转换受限等问题。我们将这些局限归因于编辑过程中对时空像素相关性建模的不足。为解决这一问题，我们提出了STR-Match，一种无需训练的视频编辑算法，它通过我们新颖的STR评分引导的潜在优化，生成视觉吸引力强且时空连贯的视频。该评分通过利用文本到视频（T2V）扩散模型中的二维空间注意力和一维时间模块，捕捉相邻帧间的时空像素相关性，而无需计算成本高昂的三维注意力机制。结合潜在掩码的潜在优化框架，STR-Match生成了时间一致且视觉保真的视频，即使在显著的领域转换下也能保持强劲性能，同时保留源视频的关键视觉属性。大量实验证明，STR-Match在视觉质量和时空一致性方面均优于现有方法。

English

Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.

STR-Match：時空相關性評分匹配的無訓練視頻編輯

STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

摘要

Support