ChronoEdit: 이미지 편집 및 세계 시뮬레이션을 위한 시간적 추론을 향하여

초록

대규모 생성 모델의 최근 발전은 이미지 편집과 문맥 내 이미지 생성 분야를 크게 진전시켰지만, 편집된 객체가 일관성을 유지해야 하는 물리적 일관성을 보장하는 데 있어 중요한 격차가 여전히 존재합니다. 이러한 능력은 특히 세계 시뮬레이션 관련 작업에서 매우 중요합니다. 본 논문에서는 이미지 편집을 비디오 생성 문제로 재구성하는 ChronoEdit 프레임워크를 제안합니다. 먼저, ChronoEdit은 입력 이미지와 편집된 이미지를 비디오의 첫 번째와 마지막 프레임으로 간주하여, 객체의 외관뿐만 아니라 학습된 시간적 일관성을 통해 움직임과 상호작용의 암묵적인 물리학을 포착하는 대규모 사전 학습된 비디오 생성 모델을 활용할 수 있게 합니다. 둘째, ChronoEdit은 추론 시점에서 명시적으로 편집을 수행하는 시간적 추론 단계를 도입합니다. 이 설정에서 목표 프레임은 추론 토큰과 함께 공동으로 노이즈 제거되어, 물리적으로 가능한 변환으로 해결 공간을 제한하는 그럴듯한 편집 궤적을 상상합니다. 그런 다음 추론 토큰은 몇 단계 후에 제거되어 전체 비디오를 렌더링하는 데 드는 높은 계산 비용을 피합니다. ChronoEdit을 검증하기 위해, 물리적 일관성이 필요한 문맥을 위한 새로운 벤치마크인 PBench-Edit을 소개하고, ChronoEdit이 시각적 충실도와 물리적 타당성 모두에서 최첨단 기준선을 능가함을 보여줍니다. ChronoEdit의 14B 및 2B 변형에 대한 코드와 모델은 프로젝트 페이지에서 공개될 예정입니다: https://research.nvidia.com/labs/toronto-ai/chronoedit

English

Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit

ChronoEdit: 이미지 편집 및 세계 시뮬레이션을 위한 시간적 추론을 향하여

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

초록

Support