ChatPaper.aiChatPaper

从视频和模拟中学习以行动和推理为中心的图像编辑

Learning Action and Reasoning-Centric Image Editing from Videos and Simulations

July 3, 2024
作者: Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, Siva Reddy
cs.AI

摘要

图像编辑模型应能执行各种编辑,从对象替换、更改属性或风格,到执行动作或运动,这需要多种形式的推理。当前的通用指导编辑模型在涉及动作和推理为中心的编辑方面存在显著缺陷。对象、属性或风格的变化可以从视觉静态数据集中学习。另一方面,用于动作和推理为中心的编辑的高质量数据稀缺,必须来自完全不同的来源,涵盖例如物理动力学、时间性和空间推理。为此,我们精心策划了AURORA数据集(动作-推理-对象-属性),这是一个高质量训练数据集,从视频和模拟引擎中人工注释和策划而来。我们专注于质量训练数据的一个关键方面:三元组(源图像、提示、目标图像)包含由提示描述的单一有意义的视觉变化,即源图像和目标图像之间的真正最小变化。为了展示我们数据集的价值,我们在一个新的专家策划的基准测试(AURORA-Bench)上评估了一个经过AURORA微调的模型,涵盖了8个不同的编辑任务。我们的模型在人类评分员的评判下明显优于先前的编辑模型。对于自动评估,我们发现先前度量标准存在重要缺陷,并警告其在语义困难的编辑任务中的使用。相反,我们提出了一个侧重于辨别理解的新自动度量标准。我们希望我们的努力:(1)策划高质量训练数据集和评估基准,(2)开展关键评估,以及(3)发布最先进的模型,将推动通用图像编辑的进一步进展。
English
An image editing model should be able to perform diverse edits, ranging from object replacement, changing attributes or style, to performing actions or movement, which require many forms of reasoning. Current general instruction-guided editing models have significant shortcomings with action and reasoning-centric edits. Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g. physical dynamics, temporality and spatial reasoning. To this end, we meticulously curate the AURORA Dataset (Action-Reasoning-Object-Attribute), a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We focus on a key aspect of quality training data: triplets (source image, prompt, target image) contain a single meaningful visual change described by the prompt, i.e., truly minimal changes between source and target images. To demonstrate the value of our dataset, we evaluate an AURORA-finetuned model on a new expert-curated benchmark (AURORA-Bench) covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters. For automatic evaluations, we find important flaws in previous metrics and caution their use for semantically hard editing tasks. Instead, we propose a new automatic metric that focuses on discriminative understanding. We hope that our efforts : (1) curating a quality training dataset and an evaluation benchmark, (2) developing critical evaluations, and (3) releasing a state-of-the-art model, will fuel further progress on general image editing.

Summary

AI-Generated Summary

PDF322November 28, 2024