3D 포인트 트랙을 활용한 생성적 비디오 모션 편집

초록

카메라와 객체 운동은 비디오 내러티브의 핵심 요소입니다. 그러나 촬영된 이러한 운동을 정밀하게 편집하는 것은 여전히 큰 과제로 남아있으며, 특히 복잡한 객체 운동 하에서 더욱 그렇습니다. 현재의 운동 제어 이미지-투-비디오(I2V) 접근법은 일관된 비디오 편집을 위한 전체 장면 맥락을 종종 결여하는 반면, 비디오-투-비디오(V2V) 방법은 시점 변경이나 기본적인 객체 이동을 제공하지만, 세밀한 객체 운동에 대한 제어는 제한적입니다. 본 논문에서는 카메라와 객체 운동의 통합 편집을 가능하게 하는 트랙 기반 V2V 프레임워크를 제안합니다. 비디오 생성 모델에 소스 비디오와 소스 및 대상 운동을 나타내는 짝을 이룬 3D 포인트 트랙을 조건으로 제공하여 이를 달성합니다. 이러한 3D 트랙은 희소 대응 관계를 설정하여 소스 비디오의 풍부한 맥락을 새로운 운동으로 전달하면서 시공간적 일관성을 보존합니다. 중요한 것은, 2D 트랙과 비교하여 3D 트랙은 명시적인 깊이 정보를 제공함으로써 모델이 깊이 순서를 해결하고 폐색을 처리하여 정확한 운동 편집을 가능하게 합니다. 합성 및 실제 데이터에 대한 2단계 학습을 통해, 우리의 모델은 카메라/객체 통합 조작, 운동 전달, 비강체 변형 등 다양한 운동 편집을 지원하여 비디오 편집에 새로운 창의적 잠재력을 열어줍니다.

English

Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.

3D 포인트 트랙을 활용한 생성적 비디오 모션 편집

Generative Video Motion Editing with 3D Point Tracks

초록

Support