AutoStory: 최소한의 인간 노력으로 다양한 스토리텔링 이미지 생성하기

초록

스토리 시각화는 텍스트로 기술된 이야기와 일치하는 일련의 이미지를 생성하는 것을 목표로 하며, 생성된 이미지가 높은 품질, 텍스트 설명과의 정렬, 그리고 캐릭터 정체성의 일관성을 충족해야 한다. 스토리 시각화의 복잡성으로 인해, 기존 방법들은 몇 가지 특정 캐릭터와 시나리오만을 고려하거나, 사용자에게 스케치와 같은 이미지별 제어 조건을 제공하도록 요구함으로써 문제를 극도로 단순화하였다. 그러나 이러한 단순화는 이러한 방법들이 실제 응용에 적합하지 않게 만든다. 이를 위해, 우리는 최소한의 인간 상호작용으로도 다양하고 고품질이며 일관된 스토리 이미지 세트를 효과적으로 생성할 수 있는 자동화된 스토리 시각화 시스템을 제안한다. 구체적으로, 우리는 레이아웃 계획을 위해 대규모 언어 모델의 이해 및 계획 능력을 활용하고, 레이아웃을 기반으로 정교한 스토리 이미지를 생성하기 위해 대규모 텍스트-이미지 모델을 활용한다. 우리는 경험적으로 경계 상자와 같은 희소 제어 조건이 레이아웃 계획에 적합한 반면, 스케치 및 키포인트와 같은 밀집 제어 조건은 고품질 이미지 콘텐츠 생성에 적합하다는 것을 발견했다. 두 가지의 장점을 모두 얻기 위해, 우리는 단순한 경계 상자 레이아웃을 최종 이미지 생성을 위한 스케치 또는 키포인트 제어 조건으로 변환하는 밀집 조건 생성 모듈을 고안하였으며, 이는 이미지 품질을 향상시킬 뿐만 아니라 쉽고 직관적인 사용자 상호작용을 가능하게 한다. 또한, 우리는 다중 뷰 일관성 캐릭터 이미지를 생성하기 위한 간단하지만 효과적인 방법을 제안하여, 캐릭터 이미지를 수집하거나 그리는 데 필요한 인간 노동에 대한 의존을 제거하였다.

English

Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images.

AutoStory: 최소한의 인간 노력으로 다양한 스토리텔링 이미지 생성하기

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

초록

Support