梦境漫画:基于视频模型的主体与布局定制化故事可视化流程
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
December 1, 2025
作者: Patrick Kwon, Chen Chen
cs.AI
摘要
当前的故事可视化方法通常仅通过文本来定位主体,且在保持艺术一致性方面面临挑战。为解决这些局限性,我们推出了DreamingComics——一个具备布局感知能力的故事可视化框架。该框架基于预训练的视频扩散变换器(DiT)模型构建,利用其时空先验特性来增强角色身份与风格的一致性。针对基于布局的位置控制,我们提出了RegionalRoPE这一区域感知位置编码方案,通过目标布局对嵌入向量进行重新索引。此外,我们还引入掩码条件损失函数,进一步将每个主体的视觉特征约束在其指定区域内。为实现从自然语言脚本推断布局,我们集成了基于大语言模型的布局生成器,该生成器经过训练可生成漫画风格布局,从而实现灵活可控的布局条件控制。全面评估表明,相较于现有方法,我们的方案在角色一致性上提升29.2%,风格相似度提高36.2%,同时展现出卓越的空间准确性。项目页面详见:https://yj7082126.github.io/dreamingcomics/
English
Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/