SEED-Story：利用大型语言模型进行多模态长篇故事生成

摘要

随着图像生成和开放式文本生成方面的显著进展，交织图像文本内容的创作已成为一个越来越引人入胜的领域。多模态故事生成以交织方式产生叙事文本和生动图像为特征，已经成为一个具有广泛应用的宝贵且实用的任务。然而，这一任务带来了重大挑战，因为它要求理解文本和图像之间复杂的相互作用，以及生成一系列连贯、与上下文相关的文本和视觉内容。在这项工作中，我们提出了SEED-Story，一种利用多模态大型语言模型（MLLM）生成扩展多模态故事的新方法。我们的模型建立在MLLM强大的理解能力之上，预测文本标记以及视觉标记，随后通过经过调整的视觉去标记器处理，生成具有一致字符和风格的图像。我们进一步提出了多模态注意力汇聚机制，以便以高效的自回归方式生成长达25个序列（仅用于训练的10个）。此外，我们提出了一个名为StoryStream的大规模高分辨率数据集，用于训练我们的模型，并在各个方面定量评估多模态故事生成任务。

English

With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.

SEED-Story：利用大型语言模型进行多模态长篇故事生成

SEED-Story: Multimodal Long Story Generation with Large Language Model

摘要

Support