通过分层记忆机制提升图像生成的可编辑性
Improving Editability in Image Generation with Layer-wise Memory
May 2, 2025
作者: Daneul Kim, Jaeah Lee, Jaesik Park
cs.AI
摘要
现实世界中的图像编辑任务通常需要多次连续编辑才能达到预期效果。当前的编辑方法主要针对单一对象的修改,在处理连续编辑时面临困难:特别是在保持先前编辑的同时,将新对象自然地融入现有内容中。这些限制严重阻碍了需要修改多个对象并保持其上下文关系的复杂编辑场景。我们通过两个关键提议来解决这一根本性挑战:支持粗略的遮罩输入,以保留现有内容并自然地整合新元素;以及支持跨多次修改的一致性编辑。我们的框架通过分层记忆实现这一点,该记忆存储了先前编辑的潜在表示和提示嵌入。我们提出了背景一致性指导,利用记忆的潜在表示来维持场景的连贯性,并在交叉注意力中引入多查询解耦,确保对现有内容的自然适应。为了评估我们的方法,我们提出了一个新的基准数据集,包含语义对齐指标和交互式编辑场景。通过全面的实验,我们展示了在迭代图像编辑任务中的卓越性能,只需用户提供粗略的遮罩,即可在多次编辑步骤中保持高质量的结果。
English
Most real-world image editing tasks require multiple sequential edits to
achieve desired results. Current editing approaches, primarily designed for
single-object modifications, struggle with sequential editing: especially with
maintaining previous edits along with adapting new objects naturally into the
existing content. These limitations significantly hinder complex editing
scenarios where multiple objects need to be modified while preserving their
contextual relationships. We address this fundamental challenge through two key
proposals: enabling rough mask inputs that preserve existing content while
naturally integrating new elements and supporting consistent editing across
multiple modifications. Our framework achieves this through layer-wise memory,
which stores latent representations and prompt embeddings from previous edits.
We propose Background Consistency Guidance that leverages memorized latents to
maintain scene coherence and Multi-Query Disentanglement in cross-attention
that ensures natural adaptation to existing content. To evaluate our method, we
present a new benchmark dataset incorporating semantic alignment metrics and
interactive editing scenarios. Through comprehensive experiments, we
demonstrate superior performance in iterative image editing tasks with minimal
user effort, requiring only rough masks while maintaining high-quality results
throughout multiple editing steps.Summary
AI-Generated Summary