MagicScroll：透過多層次語義感知去噪進行視覺敘事的非典型長寬比圖像生成

摘要

視覺敘事常使用非典型長寬比的圖像，如卷軸畫、漫畫條帶和全景圖，以創造富有表現力和引人入勝的故事情節。儘管生成式人工智慧已取得巨大成功，展示了重塑創意產業的潛力，但生成具有任意大小和可控風格、概念和佈局的連貫且引人入勝的內容仍然是一項挑戰，這些要素對於視覺敘事至關重要。為了克服以往方法的缺點，包括重複內容、風格不一致和缺乏可控性，我們提出了MagicScroll，這是一個多層次、漸進擴散式圖像生成框架，具有新穎的語義感知去噪過程。該模型能夠對生成的圖像在對象、場景和背景層面上進行精細控制，並具有文字、圖像和佈局條件。我們還為視覺敘事的非典型長寬比圖像生成建立了第一個基準，包括繪畫、漫畫和電影全景等媒介，並針對系統性評估定制了指標。通過比較和消融研究，MagicScroll展示了與敘事文本一致、提高視覺一致性並吸引觀眾的有希望的結果。我們計劃發布代碼和基準，希望AI研究人員和涉及視覺敘事的創意從業者之間能有更好的合作。

English

Visual storytelling often uses nontypical aspect-ratio images like scroll paintings, comic strips, and panoramas to create an expressive and compelling narrative. While generative AI has achieved great success and shown the potential to reshape the creative industry, it remains a challenge to generate coherent and engaging content with arbitrary size and controllable style, concept, and layout, all of which are essential for visual storytelling. To overcome the shortcomings of previous methods including repetitive content, style inconsistency, and lack of controllability, we propose MagicScroll, a multi-layered, progressive diffusion-based image generation framework with a novel semantic-aware denoising process. The model enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions. We also establish the first benchmark for nontypical aspect-ratio image generation for visual storytelling including mediums like paintings, comics, and cinematic panoramas, with customized metrics for systematic evaluation. Through comparative and ablation studies, MagicScroll showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience. We plan to release the code and benchmark in the hope of a better collaboration between AI researchers and creative practitioners involving visual storytelling.

MagicScroll：透過多層次語義感知去噪進行視覺敘事的非典型長寬比圖像生成

MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

摘要

Support