Warp-as-History: 단일 훈련 비디오로부터 일반화 가능한 카메라 제어 비디오 생성

초록

카메라 제어 비디오 생성은 상당한 진전을 이루었으며, 생성된 비디오가 지정된 시점 궤적을 따를 수 있게 되었다. 그러나 기존 방법들은 일반적으로 카메라 인코더, 제어 분기, 또는 어텐션 및 위치 인코딩 수정을 통해 카메라별 조건화를 학습하며, 이는 종종 대규모 카메라 주석 비디오에 대한 사후 학습을 필요로 한다. 학습 불필요 대안은 이러한 사후 학습을 피하지만, 종종 비용을 테스트 시간 최적화나 추가적인 잡음 제거 시간 안내로 전가한다. 우리는 Warp-as-History를 제안한다. 이는 카메라 유도 왜곡을 대상 프레임 위치 정렬 및 가시 토큰 선택을 통해 카메라 왜곡된 의사 이력으로 변환하는 간단한 인터페이스이다. 주어진 대상 카메라 궤적에 대해, 우리는 과거 관측치로부터 카메라 왜곡된 의사 이력을 구성하고 이를 모델의 시각적 이력 경로를 통해 공급한다. 중요하게도, 우리는 그 위치 인코딩을 잡음 제거 중인 대상 프레임과 정렬하고, 유효한 소스 관측치가 없는 왜곡 이력 토큰을 제거한다. 어떤 학습, 구조 수정, 또는 테스트 시간 최적화 없이도, 이 인터페이스는 고정된 비디오 생성 모델이 카메라 궤적을 따르는 사소하지 않은 제로샷 능력을 드러낸다. 더욱이, 단 하나의 카메라 주석 비디오에 대한 경량의 오프라인 LoRA 미세 조정은 이 능력을 더욱 향상시키고 보지 못한 비디오로 일반화하여, 테스트 시간 최적화나 대상 비디오 적응 없이도 카메라 준수, 시각적 품질 및 움직임 역학을 개선한다. 다양한 데이터셋에 대한 광범위한 실험은 우리 방법의 효과성을 확인한다.

English

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.