ChatPaper.aiChatPaper

種子故事:利用大型語言模型進行多模態長篇故事生成

SEED-Story: Multimodal Long Story Generation with Large Language Model

July 11, 2024
作者: Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen
cs.AI

摘要

隨著圖像生成和開放式文本生成的顯著進展,創建交錯的圖像-文本內容已成為一個越來越引人入勝的領域。多模態故事生成以交錯方式產生敘事文本和生動圖像為特徵,已成為一項具有廣泛應用價值和實用性的任務。然而,這項任務帶來了重大挑戰,因為它需要理解文本和圖像之間的複雜相互作用,以及生成一系列連貫、情境相關的文本和視覺元素。在這項工作中,我們提出了SEED-Story,一種利用多模態大型語言模型(MLLM)生成擴展多模態故事的新方法。我們的模型基於MLLM強大的理解能力,預測文本標記和視覺標記,隨後使用適應的視覺去標記器處理這些標記,以生成具有一致角色和風格的圖像。我們進一步提出多模態注意力沉澱機制,以實現以高效的自回歸方式生成長達25個序列(僅用於訓練的10個)的故事。此外,我們提出了一個名為StoryStream的大規模高分辨率數據集,用於訓練我們的模型並在各個方面對多模態故事生成任務進行定量評估。
English
With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.

Summary

AI-Generated Summary

PDF265November 28, 2024