CamCo: 카메라 제어 가능한 3D 일관성 이미지-비디오 생성

초록

최근 비디오 확산 모델(diffusion model)이 일반 사용자도 쉽게 접근할 수 있는 고품질 비디오 콘텐츠 생성 도구로 주목받고 있습니다. 그러나 이러한 모델들은 비디오 생성 시 카메라 포즈를 정밀하게 제어하는 기능을 제공하지 않아 시네마틱 언어의 표현과 사용자 제어에 한계가 있었습니다. 이 문제를 해결하기 위해, 우리는 이미지-투-비디오 생성(image-to-video generation)을 위한 세밀한 카메라 포즈 제어가 가능한 CamCo를 소개합니다. 우리는 사전 훈련된 이미지-투-비디오 생성기에 Pl\"ucker 좌표를 사용하여 정확하게 매개변수화된 카메라 포즈 입력을 추가했습니다. 생성된 비디오의 3D 일관성을 향상시키기 위해, 각 어텐션 블록(attention block)에 에피폴라 제약(epipolar constraint)을 특징 맵(feature map)에 적용하는 에피폴라 어텐션 모듈(epipolar attention module)을 통합했습니다. 또한, 구조-움직임 복원(structure-from-motion) 알고리즘으로 추정된 카메라 포즈가 포함된 실제 비디오 데이터로 CamCo를 미세 조정(fine-tune)하여 객체 움직임을 더 잘 합성할 수 있도록 했습니다. 실험 결과, CamCo는 기존 모델 대비 3D 일관성과 카메라 제어 능력을 크게 개선하면서도 그럴듯한 객체 움직임을 효과적으로 생성하는 것으로 나타났습니다. 프로젝트 페이지: https://ir1d.github.io/CamCo/

English

Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/

CamCo: 카메라 제어 가능한 3D 일관성 이미지-비디오 생성

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

초록

Support