객체 외관과 맥락을 통한 세밀한 제어 가능한 비디오 생성

초록

텍스트-투-비디오 생성은 유망한 결과를 보여주고 있다. 그러나 자연어만을 입력으로 사용함에 따라 사용자들은 모델의 출력을 정밀하게 제어하기 위한 상세한 정보를 제공하는 데 어려움을 겪는 경우가 많다. 본 연구에서는 세부적인 제어를 달성하기 위해 미세 조정 가능한 비디오 생성(FACTOR)을 제안한다. 구체적으로, FACTOR는 텍스트 프롬프트와 함께 객체의 외형과 위치 및 카테고리를 포함한 컨텍스트를 제어하는 것을 목표로 한다. 세부적인 제어를 달성하기 위해, 우리는 기존의 텍스트-투-비디오 모델에 제어 신호를 통합적으로 주입하는 통합 프레임워크를 제안한다. 우리의 모델은 공통 인코더와 적응형 교차 주의 계층으로 구성된다. 인코더와 삽입된 계층을 최적화함으로써, 모델이 텍스트 프롬프트와 미세 조정 제어 모두에 맞춰 비디오를 생성하도록 적응시킨다. 에지 맵과 같은 밀집 제어 신호에 의존하는 기존 방법들과 비교하여, 우리는 객체 수준의 미세 조정 제어를 가능하게 하는 더 직관적이고 사용자 친화적인 인터페이스를 제공한다. 우리의 방법은 파인튜닝 없이도 객체 외형의 제어 가능성을 달성함으로써 사용자별 최적화 노력을 줄인다. 표준 벤치마크 데이터셋과 사용자 제공 입력에 대한 광범위한 실험을 통해, 우리의 모델이 경쟁력 있는 베이스라인 대비 제어 가능성 지표에서 70%의 개선을 달성함을 검증하였다.

English

Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.

객체 외관과 맥락을 통한 세밀한 제어 가능한 비디오 생성

Fine-grained Controllable Video Generation via Object Appearance and Context

초록

Support