VD3D: 대형 비디오 확산 트랜스포머를 3D 카메라 제어에 적용하기

초록

최신 텍스트-투-비디오 합성 모델은 텍스트 설명으로부터 복잡한 비디오를 일관성 있고 사실적으로 생성하는 능력을 보여줍니다. 그러나 대부분의 기존 모델은 콘텐츠 제작, 시각 효과, 3D 비전과 관련된 다운스트림 애플리케이션에 있어 중요한 카메라 이동에 대한 세밀한 제어가 부족합니다. 최근에는 사전 학습된 U-Net 기반의 확산 모델을 활용하여 공간적 및 시간적 생성을 명시적으로 분리함으로써 제어 가능한 카메라 포즈로 비디오를 생성할 수 있는 새로운 방법들이 등장했습니다. 그럼에도 불구하고, 공간적 및 시간적 정보를 함께 처리하는 트랜스포머 기반 비디오 확산 모델에 대한 카메라 제어를 가능하게 하는 기존의 접근 방식은 없습니다. 본 연구에서는 Plucker 좌표를 기반으로 한 시공간적 카메라 임베딩을 통합하는 ControlNet과 유사한 조건화 메커니즘을 사용하여 3D 카메라 제어를 위한 비디오 트랜스포머를 제어하는 방법을 제안합니다. 이 접근 방식은 RealEstate10K 데이터셋에 대한 미세 조정 후 제어 가능한 비디오 생성에서 최첨단 성능을 보여줍니다. 우리가 알고 있는 한, 본 연구는 트랜스포머 기반 비디오 확산 모델에 대한 카메라 제어를 가능하게 한 첫 번째 사례입니다.

English

Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.

VD3D: 대형 비디오 확산 트랜스포머를 3D 카메라 제어에 적용하기

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

초록

Support