ChatPaper.aiChatPaper

基于时序推理器的统一视频编辑

Unified Video Editing with Temporal Reasoner

December 8, 2025
作者: Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu
cs.AI

摘要

现有视频编辑方法面临一个关键权衡:专家模型精度高但依赖任务特定先验(如遮罩),难以统一;反之,统一的时序上下文学习模型无需遮罩,但缺乏显式空间线索,导致指令-区域映射模糊与定位不精准。为解决这一矛盾,我们受思维链推理启发,提出VideoCoF——一种新颖的帧间推理链方法。VideoCoF通过强制视频扩散模型在生成目标视频令牌前先预测推理令牌(编辑区域潜变量),构建"先观察、再推理、后编辑"的流程。这种显式推理步骤既无需用户提供遮罩,又能实现精准的指令-区域对齐与细粒度视频编辑。此外,我们提出RoPE对齐策略,利用推理令牌确保运动对齐,并实现超越训练时长的长度外推。实验表明,仅需5万对视频的最小数据成本,VideoCoF即在VideoCoF-Bench上达到领先性能,验证了方法的效率与有效性。代码、权重及数据详见https://github.com/knightyxp/VideoCoF。
English
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
PDF356December 10, 2025