FFP-300K：扩展首帧传播技术实现通用化视频编辑

摘要

首帧传播（FFP）为可控视频编辑提供了前景广阔的新范式，但现有方法受限于对繁琐运行时引导的依赖。我们发现这一局限的根本原因在于当前训练数据集的不足——其往往时长过短、分辨率低下，且缺乏教导鲁棒时序先验所需的任务多样性。为填补这一基础性数据空白，我们首先提出了FFP-300K数据集，该大规模数据集通过双轨制流水线构建，包含30万对720p分辨率、81帧长度的高保真视频对，支持多样化的局部与全局编辑。基于此数据集，我们设计了一种真正无需引导的FFP创新框架，有效解决了保持首帧外观与维持源视频运动之间的核心矛盾。在架构层面，我们提出自适应时空旋转位置编码（AST-RoPE），通过动态重映射位置编码实现外观与运动参考的解耦；在目标层面，采用以身份传播任务作为强正则子的自蒸馏策略，确保长期时序稳定性并防止语义漂移。EditVerseBench基准测试表明，本方法在PickScore和VLM评分上分别以约0.2分和0.3分的优势显著超越现有学术及商业模型。

English

First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.

FFP-300K：扩展首帧传播技术实现通用化视频编辑

FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

摘要

Support