場景分解：從單張圖像中提取多個概念

摘要

文字到圖像模型個性化旨在將使用者提供的概念引入模型，使其能夠在不同情境中進行綜合。然而，目前的方法主要集中在從多個具有不同背景和姿勢變化的圖像中學習單一概念的情況，當適應不同情境時往往會遇到困難。在這項工作中，我們引入了文本場景分解任務：給定一個可能包含多個概念的場景的單一圖像，我們旨在為每個概念提取一個獨特的文本標記，從而實現對生成的場景進行精細控制。為此，我們提出了通過指示目標概念存在的蒙版來擴充輸入圖像的方法。這些蒙版可以由用戶提供，也可以通過預先訓練的分割模型自動生成。然後，我們提出了一種新的兩階段定制過程，優化一組專用文本嵌入（handles）以及模型權重，取得準確捕捉概念並避免過度擬合之間的微妙平衡。我們使用遮罩擴散損失來使handles生成其分配的概念，並通過交叉注意力地圖上的新損失來防止糾纏。我們還引入了聯合抽樣，這是一種旨在改善生成圖像中結合多個概念能力的訓練策略。我們使用多個自動指標定量比較我們的方法與幾個基準方法，並通過用戶研究進一步確認結果。最後，我們展示了我們方法的幾個應用。項目頁面位於：https://omriavrahami.com/break-a-scene/

English

Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: https://omriavrahami.com/break-a-scene/

場景分解：從單張圖像中提取多個概念

Break-A-Scene: Extracting Multiple Concepts from a Single Image

摘要

Support