Anim-Director:一個大型多模型動力代理,用於可控動畫視頻生成。
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
August 19, 2024
作者: Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang
cs.AI
摘要
傳統動畫生成方法依賴訓練具有人類標註數據的生成模型,這需要一個複雜的多階段流程,需要大量人力投入並產生高昂的訓練成本。由於受限於提示計劃,這些方法通常生成簡短、信息匱乏和上下文不連貫的動畫。為了克服這些限制並自動化動畫製作過程,我們開創性地引入了大型多模型模型(LMMs)作為核心處理器,構建一個名為Anim-Director的自主動畫製作代理人。該代理人主要利用LMMs和生成式人工智能工具的先進理解和推理能力,從簡潔的敘事或簡單的指示中創建動畫視頻。具體而言,它分為三個主要階段:首先,Anim-Director從用戶輸入生成一個連貫的故事情節,接著是一份詳細的導演劇本,包括角色設定和內部/外部描述,以及上場角色、內部或外部和場景事件的上下文連貫的場景描述。其次,我們利用LMMs與圖像生成工具來生成設置和場景的視覺圖像。這些圖像旨在通過一種視覺語言提示方法來保持不同場景之間的視覺一致性,該方法結合了場景描述和出現角色和設置的圖像。第三,場景圖像作為生成動畫視頻的基礎,LMMs生成提示來引導這個過程。整個過程明顯是自主的,無需手動干預,因為LMMs與生成工具無縫互動,生成提示,評估視覺質量,並選擇最佳提示以優化最終輸出。
English
Traditional animation generation methods depend on training generative models
with human-labelled data, entailing a sophisticated multi-stage pipeline that
demands substantial human effort and incurs high training costs. Due to limited
prompting plans, these methods typically produce brief, information-poor, and
context-incoherent animations. To overcome these limitations and automate the
animation process, we pioneer the introduction of large multimodal models
(LMMs) as the core processor to build an autonomous animation-making agent,
named Anim-Director. This agent mainly harnesses the advanced understanding and
reasoning capabilities of LMMs and generative AI tools to create animated
videos from concise narratives or simple instructions. Specifically, it
operates in three main stages: Firstly, the Anim-Director generates a coherent
storyline from user inputs, followed by a detailed director's script that
encompasses settings of character profiles and interior/exterior descriptions,
and context-coherent scene descriptions that include appearing characters,
interiors or exteriors, and scene events. Secondly, we employ LMMs with the
image generation tool to produce visual images of settings and scenes. These
images are designed to maintain visual consistency across different scenes
using a visual-language prompting method that combines scene descriptions and
images of the appearing character and setting. Thirdly, scene images serve as
the foundation for producing animated videos, with LMMs generating prompts to
guide this process. The whole process is notably autonomous without manual
intervention, as the LMMs interact seamlessly with generative tools to generate
prompts, evaluate visual quality, and select the best one to optimize the final
output.Summary
AI-Generated Summary