ChatPaper.aiChatPaper

從影片和模擬中學習以行動和推理為中心的圖像編輯

Learning Action and Reasoning-Centric Image Editing from Videos and Simulations

July 3, 2024
作者: Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, Siva Reddy
cs.AI

摘要

一個影像編輯模型應該能夠執行各種不同的編輯,從物件替換、更改屬性或風格,到執行動作或移動,這需要多種形式的推理。目前的一般指導式編輯模型在動作和推理中心的編輯方面存在顯著缺陷。從視覺靜態數據集中可以學習物件、屬性或風格的變化。另一方面,針對動作和推理中心的編輯的高質量數據稀缺,必須來自完全不同的來源,包括物理動態、時間性和空間推理等。為此,我們精心策劃了 AURORA 數據集(Action-Reasoning-Object-Attribute),這是一個由視頻和模擬引擎中的高質量訓練數據人工標註和策劃而成的集合。我們專注於高質量訓練數據的一個關鍵方面:三元組(源圖像、提示、目標圖像)包含由提示描述的單一有意義的視覺變化,即源圖像和目標圖像之間真正最小的變化。為了展示我們數據集的價值,我們在一個新的專家策劃的基準測試(AURORA-Bench)上評估了一個經過 AURORA 微調的模型,該基準測試涵蓋 8 個不同的編輯任務。根據人工評分員的評估,我們的模型明顯優於先前的編輯模型。對於自動評估,我們發現先前的指標存在重要缺陷,並警告其在語義困難的編輯任務中的使用。相反,我們提出了一個新的自動指標,著重於具有辨識性的理解。我們希望我們的努力:(1)策劃高質量訓練數據集和評估基準,(2)開展關鍵評估,以及(3)發布一個最先進的模型,將推動通用影像編輯的進一步進展。
English
An image editing model should be able to perform diverse edits, ranging from object replacement, changing attributes or style, to performing actions or movement, which require many forms of reasoning. Current general instruction-guided editing models have significant shortcomings with action and reasoning-centric edits. Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g. physical dynamics, temporality and spatial reasoning. To this end, we meticulously curate the AURORA Dataset (Action-Reasoning-Object-Attribute), a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We focus on a key aspect of quality training data: triplets (source image, prompt, target image) contain a single meaningful visual change described by the prompt, i.e., truly minimal changes between source and target images. To demonstrate the value of our dataset, we evaluate an AURORA-finetuned model on a new expert-curated benchmark (AURORA-Bench) covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters. For automatic evaluations, we find important flaws in previous metrics and caution their use for semantically hard editing tasks. Instead, we propose a new automatic metric that focuses on discriminative understanding. We hope that our efforts : (1) curating a quality training dataset and an evaluation benchmark, (2) developing critical evaluations, and (3) releasing a state-of-the-art model, will fuel further progress on general image editing.

Summary

AI-Generated Summary

PDF322November 28, 2024