ChatPaper.aiChatPaper

時序編輯:邁向影像處理與世界模擬的時間推理

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

October 5, 2025
作者: Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling
cs.AI

摘要

近期,大型生成模型在图像编辑和上下文图像生成方面取得了显著进展,然而在确保物理一致性方面仍存在关键缺口,即编辑后的物体必须保持连贯性。这一能力对于世界模拟相关任务尤为重要。本文提出ChronoEdit框架,将图像编辑重新定义为视频生成问题。首先,ChronoEdit将输入图像和编辑后的图像视为视频的首尾帧,从而能够利用大规模预训练的视频生成模型,这些模型不仅捕捉物体外观,还通过学习到的时间一致性隐含了运动和交互的物理规律。其次,ChronoEdit引入了一个时间推理阶段,在推理时显式执行编辑。在此设置下,目标帧与推理标记共同去噪,以构想一个合理的编辑轨迹,将解空间约束在物理可行的变换范围内。推理标记在几步后被丢弃,以避免渲染完整视频的高计算成本。为验证ChronoEdit,我们引入了PBench-Edit,这是一个针对需要物理一致性的上下文而设计的图像-提示对新基准,并展示了ChronoEdit在视觉保真度和物理合理性上均超越了现有最先进的基线。ChronoEdit的14B和2B变体的代码和模型将在项目页面上发布:https://research.nvidia.com/labs/toronto-ai/chronoedit。
English
Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit
PDF92October 7, 2025