FlowLong: 매니폴드 제약 Tweedie 매칭을 통한 추론 시 장시간 비디오 생성

초록

비디오 확산 모델의 생성 지평을 긴 시퀀스로 확장하는 것은 오랫동안 중요하게 다뤄져 온 과제이다. 기존의 학습 없는 접근법은 두 가지 범주로 나뉜다: 특정 아키텍처에 밀접하게 결합되어 장기간에 걸쳐 품질 저하를 겪는 양방향 모델의 확장과, 노출 편향으로 인해 표류 오차가 누적되어 반복적인 운동 패턴을 생성하는 경향이 있는 자기회귀 모델이다. 이러한 문제를 해결하기 위해, 우리는 아키텍처에 구애받지 않고 추가 학습이 필요 없는, 새롭지만 단순한 추론 시간 기반의 긴 비디오 생성 접근법을 제안한다. 우리의 방법은 중첩 슬라이딩 윈도우를 통해 긴 비디오를 생성하며, 인접한 윈도우에서 예측된 깨끗한 샘플을 트위디 매칭으로 혼합하여 중첩 영역에서 다양체 제약과 시간적 일관성을 모두 강제한다. 그런 다음 확률적 초기 단계 샘플링을 통해 각 윈도우의 궤적을 동기화하는데, 이는 고노이즈 단계에서 각 트위디 매칭 보정 후 새로운 잡음을 주입하고, 이후 결정론적 상미분방정식 샘플링으로 전환하여 세밀한 시각적 충실도를 보존하는 방식으로 이루어진다. 다양한 비디오 생성 모델에 적용된 우리의 방법은 기본 윈도우 길이보다 몇 배 더 긴 비디오를 생성하면서 시간적 일관성과 시각적 품질에서 학습 없는 기준선과 자기회귀 기준선을 모두 능가하며, 추가 미세 조정 없이 오디오-비디오 공동 생성 및 텍스트-3DGS로도 확장 가능하다.

English

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via Tweedie matching to enforce both manifold constraint and temporal consistency across overlap regions. Stochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.