ChatPaper.aiChatPaper

ReViSE:基于自反思学习的统一模型理性感知视频编辑研究

ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

December 10, 2025
作者: Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo
cs.AI

摘要

视频统一模型在理解与生成方面展现出强大能力,但即便配备强大的内部视觉语言模型(VLM),其在推理引导的视频编辑任务中仍存在困难。我们将此差距归因于两个因素:1)现有数据集难以满足推理感知视频编辑的训练与评估需求;2)模型推理能力与编辑能力之间存在固有脱节,导致丰富的理解信息无法有效指导编辑过程。弥合这一差距需要构建连接推理与视觉转换的集成框架。为此,我们提出推理引导视频编辑(RVE)任务,要求编辑过程中兼顾物理合理性与因果动态的推演。为支持系统化评估,我们构建了RVE-Bench综合基准,包含两个互补子集:推理引导视频编辑与上下文视频生成。这些子集覆盖了多维推理场景和真实世界编辑需求。基于此,我们提出ReViSE——一种将生成与评估统一于单一架构的自反思推理(SRF)框架。该模型通过内部VLM评估编辑后视频是否在逻辑上满足指令要求,从而提供内在反馈。这种差异化反馈能在训练过程中持续优化生成器的推理行为。在RVE-Bench上的大量实验表明,ReViSE显著提升了编辑准确度与视觉保真度,在推理引导视频编辑子集上的综合得分较现有最优方法提升32%。
English
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
PDF21December 13, 2025