트랙 따라가기: 포인트 추적을 통한 비디오 합성 및 모션 제어

초록

영화 제작에는 정밀한 모션 제어와 참조 이미지 합성 능력이 요구되며, 기존 방법들은 이 둘을 별도로 처리한다. 포인트 트랙 조건화된 이미지-투-비디오(point-track-conditioned image-to-video) 모델은 콘텐츠 삽입을 첫 번째 프레임으로 제한하는 반면, 참조-투-비디오(reference-to-video) 모델은 참조 콘텐츠가 프레임 전체에 걸쳐 통합되는 방식을 세밀하게 공간-시간적으로 제어하지 못한다. 본 논문에서는 Go-with-the-Track을 제안한다. 이 방법은 여러 참조 이미지와 참조 기준 포인트 트랙(reference-anchored point-tracks)을 공동 조건(jointly conditioning)으로 사용하여 두 기능을 통합한다. 이는 기존 포인트 트랙을 확장하여 생성된 프레임과 참조 이미지 간의 대응 관계를 명시적으로 설정함으로써, 비디오 전체에 걸쳐 정밀한 합성과 모션 제어를 가능하게 한다. 이를 달성하기 위해, 좌표별 MLP(coordinate-wise MLP)와 이어지는 시간적 풀링(temporal pooling)을 사용하여 포인트 트랙 좌표의 전체 시퀀스를 인코딩하는 공간 인식 포인트 트랙 임베딩(spatially-aware point-track embeddings)을 도입한다. 이 표현은 각 포인트 트랙의 공간적 특성을 포착하여(고유 식별자 역할을 함), 임베딩 유사도가 공간적 근접성과 직접적으로 상관관계를 가지도록 하여, 모델이 포인트 트랙을 구별하고 연관 짓는 능력을 향상시킨다. 이러한 포인트 트랙 임베딩을 경량 어댑터(lightweight adapter)를 통해 비디오 확산 트랜스포머(video diffusion transformer)에 주입하여, 픽셀과 패치 간 해상도 불일치를 해결함과 동시에 단순한 포인트 트랙 서브샘플링(naive point-track subsampling)에 내재된 상당한 모션 세부 정보 손실을 방지한다. 동적, 정적 및 합성 장면 비디오 데이터셋에 대해 공동으로 학습시키기 위해 하이브리드 학습 전략(hybrid training strategy)을 사용하여 모션 제어성을 향상시킨다. 실험 결과, Go-with-the-Track은 단일 모델에서 우수한 모션 및 참조 제어를 달성하며, 포인트 트랙 기반 합성과 함께 다중 참조 조건화된 비디오 생성, 그리고 정적 및 동적 장면 모두에 대한 카메라 제어와 같은 새로운 기능을 가능하게 함을 보여준다. 프로젝트 페이지: https://eyeline-labs.github.io/Go-with-the-Track/

English

Filmmaking demands precise motion control and reference image compositing -- capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks -- extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model's ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/