감사 및 수정: 텍스트-이미지 확산 모델에서 일관된 스토리 시각화를 위한 행위자적 프레임워크

초록

스토리 시각화는 여러 패널에 걸쳐 내러티브를 묘사하는 시각적 장면을 생성하는 인기 있는 작업으로 자리 잡았습니다. 이 설정에서의 주요 과제는 특히 캐릭터와 객체가 스토리 전반에 걸쳐 지속되고 진화하는 방식에서 시각적 일관성을 유지하는 것입니다. 최근 디퓨전 모델의 발전에도 불구하고, 현재의 접근법들은 주요 캐릭터 속성을 보존하지 못해 일관성 없는 내러티브를 초래하는 경우가 많습니다. 본 연구에서는 다중 패널 스토리 시각화에서 불일치를 자율적으로 식별, 수정, 개선하는 협업형 다중 에이전트 프레임워크를 제안합니다. 이 에이전트들은 반복적인 루프에서 작동하며, 전체 시퀀스를 재생성하지 않고도 세밀한 패널 수준의 업데이트를 가능하게 합니다. 우리의 프레임워크는 모델에 구애받지 않으며, Flux와 같은 정류 흐름 트랜스포머 및 Stable Diffusion과 같은 잠재 디퓨전 모델을 포함한 다양한 디퓨전 모델과 유연하게 통합됩니다. 정량적 및 정성적 실험을 통해 우리의 방법이 다중 패널 일관성 측면에서 기존 접근법들을 능가함을 보여줍니다.

English

Story visualization has become a popular task where visual scenes are generated to depict a narrative across multiple panels. A central challenge in this setting is maintaining visual consistency, particularly in how characters and objects persist and evolve throughout the story. Despite recent advances in diffusion models, current approaches often fail to preserve key character attributes, leading to incoherent narratives. In this work, we propose a collaborative multi-agent framework that autonomously identifies, corrects, and refines inconsistencies across multi-panel story visualizations. The agents operate in an iterative loop, enabling fine-grained, panel-level updates without re-generating entire sequences. Our framework is model-agnostic and flexibly integrates with a variety of diffusion models, including rectified flow transformers such as Flux and latent diffusion models such as Stable Diffusion. Quantitative and qualitative experiments show that our method outperforms prior approaches in terms of multi-panel consistency.