透過分層記憶機制提升圖像生成的可編輯性
Improving Editability in Image Generation with Layer-wise Memory
May 2, 2025
作者: Daneul Kim, Jaeah Lee, Jaesik Park
cs.AI
摘要
大多數現實世界的圖像編輯任務需要進行多次連續編輯才能達到預期效果。當前主要針對單一對象修改的編輯方法在處理連續編輯時存在困難:特別是在保持先前編輯的同時,如何自然地將新對象融入現有內容中。這些限制嚴重阻礙了需要修改多個對象並保持其上下文關係的複雜編輯場景。我們通過兩個關鍵提案來應對這一根本挑戰:支持粗略遮罩輸入以保留現有內容並自然地整合新元素,以及支持跨多次修改的一致性編輯。我們的框架通過分層記憶實現這一點,該記憶存儲了先前編輯的潛在表示和提示嵌入。我們提出了背景一致性指導,利用記憶的潛在表示來保持場景連貫性,並在交叉注意力中引入多查詢解耦,確保對現有內容的自然適應。為了評估我們的方法,我們提出了一個新的基準數據集,包含語義對齊指標和互動編輯場景。通過全面的實驗,我們展示了在迭代圖像編輯任務中的卓越性能,只需用戶提供粗略遮罩即可在多個編輯步驟中保持高質量結果。
English
Most real-world image editing tasks require multiple sequential edits to
achieve desired results. Current editing approaches, primarily designed for
single-object modifications, struggle with sequential editing: especially with
maintaining previous edits along with adapting new objects naturally into the
existing content. These limitations significantly hinder complex editing
scenarios where multiple objects need to be modified while preserving their
contextual relationships. We address this fundamental challenge through two key
proposals: enabling rough mask inputs that preserve existing content while
naturally integrating new elements and supporting consistent editing across
multiple modifications. Our framework achieves this through layer-wise memory,
which stores latent representations and prompt embeddings from previous edits.
We propose Background Consistency Guidance that leverages memorized latents to
maintain scene coherence and Multi-Query Disentanglement in cross-attention
that ensures natural adaptation to existing content. To evaluate our method, we
present a new benchmark dataset incorporating semantic alignment metrics and
interactive editing scenarios. Through comprehensive experiments, we
demonstrate superior performance in iterative image editing tasks with minimal
user effort, requiring only rough masks while maintaining high-quality results
throughout multiple editing steps.Summary
AI-Generated Summary