STR-Match:時空相關性評分匹配的無訓練視頻編輯
STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing
June 28, 2025
作者: Junsung Lee, Junoh Kang, Bohyung Han
cs.AI
摘要
以往的文本引导视频编辑方法常面临时间不一致性、运动失真以及最为显著的领域转换受限等问题。我们将这些局限归因于编辑过程中对时空像素相关性建模的不足。为解决这一问题,我们提出了STR-Match,一种无需训练的视频编辑算法,它通过我们新颖的STR评分引导的潜在优化,生成视觉吸引力强且时空连贯的视频。该评分通过利用文本到视频(T2V)扩散模型中的二维空间注意力和一维时间模块,捕捉相邻帧间的时空像素相关性,而无需计算成本高昂的三维注意力机制。结合潜在掩码的潜在优化框架,STR-Match生成了时间一致且视觉保真的视频,即使在显著的领域转换下也能保持强劲性能,同时保留源视频的关键视觉属性。大量实验证明,STR-Match在视觉质量和时空一致性方面均优于现有方法。
English
Previous text-guided video editing methods often suffer from temporal
inconsistency, motion distortion, and-most notably-limited domain
transformation. We attribute these limitations to insufficient modeling of
spatiotemporal pixel relevance during the editing process. To address this, we
propose STR-Match, a training-free video editing algorithm that produces
visually appealing and spatiotemporally coherent videos through latent
optimization guided by our novel STR score. The score captures spatiotemporal
pixel relevance across adjacent frames by leveraging 2D spatial attention and
1D temporal modules in text-to-video (T2V) diffusion models, without the
overhead of computationally expensive 3D attention mechanisms. Integrated into
a latent optimization framework with a latent mask, STR-Match generates
temporally consistent and visually faithful videos, maintaining strong
performance even under significant domain transformations while preserving key
visual attributes of the source. Extensive experiments demonstrate that
STR-Match consistently outperforms existing methods in both visual quality and
spatiotemporal consistency.