动画导演:一个大型多模态模型驱动的代理,用于可控动画视频生成
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
August 19, 2024
作者: Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang
cs.AI
摘要
传统的动画生成方法依赖于使用人工标记数据训练生成模型,这需要一个复杂的多阶段流程,需要大量人力投入并产生高昂的训练成本。由于受限于提示计划,这些方法通常生成简短、信息贫乏和上下文不连贯的动画。为了克服这些限制并自动化动画制作过程,我们首次引入了大型多模态模型(LMMs)作为核心处理器,构建了一个名为Anim-Director的自主动画制作代理。该代理主要利用LMMs和生成式人工智能工具的先进理解和推理能力,从简明的叙述或简单的指令中创建动画视频。具体而言,它分为三个主要阶段:首先,Anim-Director从用户输入生成连贯的故事情节,然后是详细的导演剧本,包括角色概况和内外部描述,以及上场角色、内部或外部环境和情境连贯的场景描述。其次,我们利用LMMs和图像生成工具生成设置和场景的视觉图像。这些图像经过设计,使用视觉语言提示方法结合场景描述和出现的角色和环境的图像,以保持不同场景之间的视觉一致性。第三,场景图像作为生成动画视频的基础,LMMs生成提示来指导这一过程。整个过程明显是自主的,无需手动干预,因为LMMs与生成工具无缝交互,生成提示,评估视觉质量,并选择最佳提示以优化最终输出。
English
Traditional animation generation methods depend on training generative models
with human-labelled data, entailing a sophisticated multi-stage pipeline that
demands substantial human effort and incurs high training costs. Due to limited
prompting plans, these methods typically produce brief, information-poor, and
context-incoherent animations. To overcome these limitations and automate the
animation process, we pioneer the introduction of large multimodal models
(LMMs) as the core processor to build an autonomous animation-making agent,
named Anim-Director. This agent mainly harnesses the advanced understanding and
reasoning capabilities of LMMs and generative AI tools to create animated
videos from concise narratives or simple instructions. Specifically, it
operates in three main stages: Firstly, the Anim-Director generates a coherent
storyline from user inputs, followed by a detailed director's script that
encompasses settings of character profiles and interior/exterior descriptions,
and context-coherent scene descriptions that include appearing characters,
interiors or exteriors, and scene events. Secondly, we employ LMMs with the
image generation tool to produce visual images of settings and scenes. These
images are designed to maintain visual consistency across different scenes
using a visual-language prompting method that combines scene descriptions and
images of the appearing character and setting. Thirdly, scene images serve as
the foundation for producing animated videos, with LMMs generating prompts to
guide this process. The whole process is notably autonomous without manual
intervention, as the LMMs interact seamlessly with generative tools to generate
prompts, evaluate visual quality, and select the best one to optimize the final
output.Summary
AI-Generated Summary