ControlVideo: 원샷 텍스트-투-비디오 편집을 위한 조건부 제어 추가

초록

본 논문에서는 텍스트 기반 비디오 편집을 위한 새로운 방법론인 ControlVideo를 소개한다. ControlVideo는 텍스트-이미지 확산 모델과 ControlNet의 기능을 활용하여, 주어진 텍스트와 일치하는 비디오의 충실도와 시간적 일관성을 향상시키면서 원본 비디오의 구조를 보존하는 것을 목표로 한다. 이를 위해 에지 맵과 같은 추가 조건을 통합하고, 원본 비디오-텍스트 쌍에 대해 키 프레임 및 시간적 주의 메커니즘을 세심하게 설계된 전략으로 미세 조정한다. ControlVideo의 설계에 대한 심층적인 탐구를 통해 원샷 튜닝 비디오 확산 모델에 대한 향후 연구에 기여한다. 정량적으로, ControlVideo는 충실도와 일관성 측면에서 다양한 경쟁 기법들을 능가하면서도 텍스트 프롬프트와의 일치성을 유지한다. 또한, 원본 콘텐츠에 대한 높은 시각적 현실감과 충실도를 제공하며, 다양한 수준의 원본 비디오 정보를 포함하는 컨트롤 활용의 유연성과 다중 컨트롤 조합의 잠재력을 입증한다. 프로젝트 페이지는 https://ml.cs.tsinghua.edu.cn/controlvideo/{https://ml.cs.tsinghua.edu.cn/controlvideo/}에서 확인할 수 있다.

English

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at https://ml.cs.tsinghua.edu.cn/controlvideo/{https://ml.cs.tsinghua.edu.cn/controlvideo/}.

ControlVideo: 원샷 텍스트-투-비디오 편집을 위한 조건부 제어 추가

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

초록

Support