MilliVid: 비디오 생성에서 장기적 일관성을 위한 계층적 잠재 표현

초록

비디오 생성 모델은 점점 더 강력해지고 있지만, 수십 개의 프레임조차도 실용적이지 않을 정도로 긴 트랜스포머 시퀀스 길이를 필요로 하기 때문에 장기적 일관성을 달성하는 것은 여전히 어려운 과제로 남아 있다. 본 연구는 다중 스케일 토큰 공간 내에서 거친-정밀 롤아웃(coarse-to-fine rollout)을 사용하여 비디오를 생성함으로써 이 문제를 완화할 수 있음을 보여준다. 우리의 접근 방식은 간단하다. 먼저, 각 프레임을 일반적인 잠재 해상도에서 프레임당 소수의 토큰에 이르는 다양한 수준의 토큰 계층으로 압축하는 오토인코더를 사전 학습한다. 가장 거친 수준은 장면 배치와 의미와 같은 가장 중요한 정보를 포착하는 반면, 더 정밀한 수준은 고주파 외관과 질감을 추가한다. 그런 다음, 거친-정밀 롤아웃을 사용하여 이러한 토큰을 생성하도록 비디오 확산 모델을 학습시킨다. 각 롤아웃 단계에서 프레임이 생성되고 컨텍스트로 사용되는 세부 수준을 신중히 제어함으로써, 기하학적 구조와 객체 영속성에서의 장기적 일관성을 유지하면서도 지각적으로 덜 중요한 세부 사항의 장기적 일관성에 더 적은 계산을 할당할 수 있다. 우리는 긴 마인크래프트 비디오의 맞춤형 데이터셋을 사용하여 이 접근 방식을 검증하였으며, 기존 기준 모델과 비교하여 훨씬 더 일관된 롤아웃을 생성함을 확인하였다.

English

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.