VideoDirectorGPT: LLM 기반 계획을 통한 일관된 다중 장면 비디오 생성

초록

최근 텍스트-투-비디오(T2V) 생성 방법이 상당한 발전을 이루었음에도 불구하고, 대부분의 연구는 단일 배경과 단일 이벤트로 구성된 짧은 비디오 클립(즉, 단일 장면 비디오)을 생성하는 데 초점을 맞추고 있다. 한편, 최근의 대형 언어 모델(LLMs)은 레이아웃과 프로그램을 생성하여 이미지 생성 모델과 같은 하위 시각적 모듈을 제어하는 능력을 입증하였다. 이는 중요한 질문을 제기한다: 이러한 LLMs에 내재된 지식을 활용하여 시간적 일관성을 가진 긴 비디오를 생성할 수 있을까? 본 논문에서는 LLMs의 지식을 활용하여 비디오 콘텐츠 계획과 기반 비디오 생성을 위한 일관된 다중 장면 비디오 생성 프레임워크인 VideoDirectorGPT를 제안한다. 구체적으로, 단일 텍스트 프롬프트가 주어지면, 우리는 비디오 플래너 LLM(GPT-4)을 통해 이를 '비디오 계획'으로 확장한다. 이는 장면 설명, 각각의 레이아웃을 가진 엔티티, 각 장면의 배경, 그리고 엔티티와 배경의 일관성 그룹화를 생성하는 과정을 포함한다. 다음으로, 비디오 플래너의 출력을 기반으로, 우리의 비디오 생성기인 Layout2Vid는 공간적 레이아웃을 명시적으로 제어할 수 있으며, 이미지 수준의 주석만으로 훈련되었음에도 불구하고 장면 간 엔티티/배경의 시간적 일관성을 유지할 수 있다. 우리의 실험은 VideoDirectorGPT 프레임워크가 단일 및 다중 장면 비디오 생성에서 레이아웃과 움직임 제어를 크게 개선하고, 장면 간 시각적 일관성을 가진 다중 장면 비디오를 생성할 수 있음을 보여준다. 또한, 이 프레임워크는 오픈 도메인 단일 장면 T2V 생성에서 최신 기술(SOTA)과 경쟁력 있는 성능을 달성한다. 우리는 또한 이 프레임워크가 레이아웃 안내의 강도를 동적으로 제어할 수 있고, 사용자가 제공한 이미지로 비디오를 생성할 수도 있음을 보여준다. 우리는 이 프레임워크가 LLMs의 계획 능력을 일관된 긴 비디오 생성에 더 잘 통합하는 미래의 연구에 영감을 줄 수 있기를 바란다.

English

Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.

VideoDirectorGPT: LLM 기반 계획을 통한 일관된 다중 장면 비디오 생성

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

초록

Support