MagicScroll：通过多层语义感知去噪进行视觉叙事的非典型纵横比图像生成

摘要

视觉叙事经常使用非典型纵横比图像，如卷轴画、漫画条和全景图，以创造富有表现力和引人入胜的叙事。虽然生成式人工智能取得了巨大成功并展示了重塑创意产业的潜力，但生成具有任意尺寸和可控风格、概念和布局的连贯且引人入胜的内容仍然是一项挑战，这些对于视觉叙事至关重要。为了克服以往方法的缺点，包括重复内容、风格不一致和缺乏可控性，我们提出了MagicScroll，这是一个多层、渐进扩散式图像生成框架，具有新颖的语义感知去噪过程。该模型使得在对象、场景和背景层面上对生成的图像进行细粒度控制，包括文本、图像和布局条件。我们还建立了首个用于视觉叙事的非典型纵横比图像生成基准，包括绘画、漫画和电影全景等媒介，并针对系统评估定制了指标。通过比较和消融研究，MagicScroll展示了与叙事文本对齐、提高视觉连贯性和吸引观众方面的有希望的结果。我们计划发布代码和基准，希望促进人工智能研究人员与涉及视觉叙事的创意从业者之间更好的合作。

English

Visual storytelling often uses nontypical aspect-ratio images like scroll paintings, comic strips, and panoramas to create an expressive and compelling narrative. While generative AI has achieved great success and shown the potential to reshape the creative industry, it remains a challenge to generate coherent and engaging content with arbitrary size and controllable style, concept, and layout, all of which are essential for visual storytelling. To overcome the shortcomings of previous methods including repetitive content, style inconsistency, and lack of controllability, we propose MagicScroll, a multi-layered, progressive diffusion-based image generation framework with a novel semantic-aware denoising process. The model enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions. We also establish the first benchmark for nontypical aspect-ratio image generation for visual storytelling including mediums like paintings, comics, and cinematic panoramas, with customized metrics for systematic evaluation. Through comparative and ablation studies, MagicScroll showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience. We plan to release the code and benchmark in the hope of a better collaboration between AI researchers and creative practitioners involving visual storytelling.

MagicScroll：通过多层语义感知去噪进行视觉叙事的非典型纵横比图像生成

MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

摘要

Support