PackForcing: 긴 동영상 샘플링과 긴 컨텍스트 추론을 위한 단기 동영상 훈련 방법

초록

자기회귀 비디오 확산 모델은 놀라운 발전을 보여왔으나, 장편 비디오 생성 시 다루기 어려운 선형적 KV 캐시 증가, 시간적 반복, 오차 누적 문제에 의해 여전히 병목 현상을 겪고 있습니다. 이러한 과제를 해결하기 위해 우리는 새로운 3분할 KV 캐시 전략을 통해 생성 이력을 효율적으로 관리하는 통합 프레임워크인 PackForcing을 제안합니다. 구체적으로, 우리는 역사적 컨텍스트를 세 가지 유형으로 분류합니다: (1) 글로벌 의미론을 유지하기 위해 초기 앵커 프레임을 완전 해상도로 보존하는 싱크 토큰; (2) 점진적 3D 컨볼루션과 저해상도 VAE 재인코딩을 융합하는 이중 분기 네트워크를 통해 대규모 시공간 압축(토큰 32배 감소)을 달성하는 미드 토큰; (3) 지역적 시간적 일관성을 보장하기 위해 완전 해상도로 유지되는 최근 토큰. 품질 저하 없이 메모리 사용량을 엄격히 제한하기 위해, 우리는 미드 토큰에 대해 동적 Top-K 컨텍스트 선택 메커니즘과 함께 버려진 토큰으로 인한 위치 격차를 무시할 수 있는 오버헤드로 원활하게 재조정하는 지속적인 시간적 RoPE 조정을 도입합니다. 이러한 원칙적인 계층적 컨텍스트 압축을 통해 PackForcing은 단일 H200 GPU에서 16 FPS로 2분 길이의 832x480 비디오를 일관성 있게 생성할 수 있습니다. 이는 KV 캐시를 단 4GB로 제한하며, 놀라운 24배 시간적 외삽(5초에서 120초)을 가능하게 하고, 제로샷 또는 단 5초 클립으로만 학습되어도 효과적으로 운영됩니다. VBench의 광범위한 결과는 최첨단 시간적 일관성(26.07)과 동적 정도(56.25)를 입증하여, 짧은 비디오 감독만으로도 고품질 장편 비디오 합성이 충분히 가능함을 증명합니다. https://github.com/ShandaAI/PackForcing

English

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PackForcing: 긴 동영상 샘플링과 긴 컨텍스트 추론을 위한 단기 동영상 훈련 방법

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

초록

Support