ART：匿名區域變換器用於可變多層透明圖像生成

摘要

多層圖像生成是一項基礎任務，它使用戶能夠隔離、選擇和編輯特定的圖像層，從而革新了與生成模型的互動方式。在本文中，我們介紹了匿名區域變換器（ART），它基於全局文本提示和匿名區域佈局，直接生成可變的多層透明圖像。受圖式理論啟發，該理論認為知識是組織在框架（圖式）中的，這些框架使人們能夠通過將新信息與先前的知識聯繫起來來解釋和學習新信息。這種匿名區域佈局允許生成模型自主決定哪一組視覺標記應與哪一組文本標記對齊，這與之前主導的圖像生成任務的語義佈局形成對比。此外，層級區域裁剪機制僅選擇屬於每個匿名區域的視覺標記，顯著降低了注意力計算成本，並實現了具有眾多不同層（例如50+）的圖像的高效生成。與全注意力方法相比，我們的方法速度提高了12倍以上，並且顯示出更少的層衝突。此外，我們提出了一種高質量的多層透明圖像自動編碼器，支持以聯合方式直接編碼和解碼可變多層圖像的透明度。通過實現精確控制和可擴展的層生成，ART為互動內容創作建立了一個新的範式。

English

Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.

ART：匿名區域變換器用於可變多層透明圖像生成

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

摘要

Support