ChatPaper.aiChatPaper

時序推理器驅動的統一影片編輯 注:這個翻譯保持了技術術語的準確性("Temporal Reasoner"譯為"時序推理器"),同時採用了符合中文技術文獻習慣的動賓結構("驅動的...編輯")。"Unified"譯為"統一"是AI領域標準譯法,強調整合多種編輯功能於單一框架的特性。

Unified Video Editing with Temporal Reasoner

December 8, 2025
作者: Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu
cs.AI

摘要

現有的影片編輯方法面臨關鍵的取捨困境:專家模型能提供精確編輯效果,但依賴於任務特定的先驗資訊(如遮罩),阻礙了統一化進程;反之,統一的時序上下文學習模型雖無需遮罩,卻缺乏顯式空間線索,導致指令與區域映射關係薄弱,造成局部編輯定位不精準。為解決此矛盾,我們受思維鏈推理啟發,提出創新性的幀間推理鏈方法VideoCoF。該方法通過強制影片擴散模型在生成目標影片標記前先預測推理標記(編輯區域潛在表示),實現「先觀察、再推理、後編輯」的流程。這種顯式推理步驟既無需使用者提供遮罩,又能達成精準的指令-區域對齊與細粒度影片編輯。此外,我們提出RoPE對齊策略,利用推理標記確保運動軌跡一致性,並實現超越訓練時長的長度外推能力。實驗證明,VideoCoF僅需5萬組影片對的極低數據成本,即在VideoCoF-Bench基準上達到最先進性能,驗證了本方法的高效性與有效性。相關代碼、權重及數據已開源於:https://github.com/knightyxp/VideoCoF。
English
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
PDF356December 10, 2025