動畫故事：搭配檢索增強式影片生成的故事說明

摘要

為視覺敘事生成影片可能是一個繁瑣且複雜的過程，通常需要進行現場拍攝或圖形動畫渲染。為了避開這些挑戰，我們的主要想法是利用現有豐富的影片片段，通過自定義它們的外觀來綜合生成一部連貫的敘事影片。我們通過開發包含兩個功能模塊的框架來實現這一目標：(i)運動結構檢索，提供由查詢文本描述的所需場景或運動上下文的影片候選項，以及(ii)結構引導的文本到影片合成，根據運動結構和文本提示生成與情節對齊的影片。對於第一個模塊，我們利用現成的影片檢索系統並提取影片深度作為運動結構。對於第二個模塊，我們提出了一個可控的影片生成模型，可以靈活控制結構和角色。通過遵循結構引導和外觀指令來合成影片。為了確保片段之間的視覺一致性，我們提出了一種有效的概念個性化方法，通過文本提示來指定所需的角色身份。廣泛的實驗表明，我們的方法在各種現有基準線上表現出顯著優勢。

English

Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.

動畫故事：搭配檢索增強式影片生成的故事說明

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

摘要

Support