EPiC: 정밀 앵커-비디오 안내를 통한 효율적인 비디오 카메라 제어 학습

초록

비디오 확산 모델(VDMs)에서 3D 카메라 제어를 위한 최근 접근법들은 주로 주석이 달린 카메라 궤적을 따라 추정된 포인트 클라우드로부터 렌더링하여 구조화된 사전 정보로 확산 모델을 안내하는 앵커 비디오를 생성한다. 그러나 포인트 클라우드 추정 과정에서 발생하는 오류는 종종 부정확한 앵커 비디오를 초래한다. 또한, 광범위한 카메라 궤적 주석이 필요하다는 점은 자원 요구량을 더욱 증가시킨다. 이러한 한계를 해결하기 위해, 본 연구에서는 고가의 카메라 궤적 주석 없이도 고품질 앵커 비디오를 자동으로 구축하는 효율적이고 정밀한 카메라 제어 학습 프레임워크인 EPiC를 소개한다. 구체적으로, 첫 프레임 가시성을 기반으로 소스 비디오를 마스킹하여 훈련용으로 매우 정밀한 앵커 비디오를 생성한다. 이 접근법은 높은 정렬을 보장하며, 카메라 궤적 주석이 필요 없으므로 어떤 야외 비디오에도 쉽게 적용하여 이미지-투-비디오(I2V) 훈련 쌍을 생성할 수 있다. 더불어, 본 연구에서는 앵커 비디오 가이던스를 가시 영역에 통합하는 경량 조건화 모듈인 Anchor-ControlNet을 소개한다. 이 모듈은 백본 모델 매개변수의 1% 미만으로 사전 훈련된 VDMs에 통합된다. 제안된 앵커 비디오 데이터와 ControlNet 모듈을 결합함으로써, EPiC는 렌더링 오정렬을 완화하기 위해 일반적으로 필요한 확산 모델 백본 수정 없이도 훨씬 적은 매개변수, 훈련 단계 및 데이터로 효율적인 훈련을 달성한다. 마스킹 기반 앵커 비디오로 훈련되었음에도 불구하고, 본 방법은 추론 시 포인트 클라우드로 생성된 앵커 비디오에도 강력하게 일반화되어 정밀한 3D 기반 카메라 제어를 가능하게 한다. EPiC는 I2V 카메라 제어 작업에서 RealEstate10K 및 MiraData 데이터셋에서 SOTA 성능을 달성하며, 양적 및 질적으로 정밀하고 강력한 카메라 제어 능력을 입증한다. 특히, EPiC는 비디오-투-비디오 시나리오에서도 강력한 제로샷 일반화 능력을 보인다.

English

Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

EPiC: 정밀 앵커-비디오 안내를 통한 효율적인 비디오 카메라 제어 학습

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

초록

Support