비디오 생성을 위한 다음 프레임 예측 모델에서 입력 프레임 컨텍스트 패킹

초록

비디오 생성을 위한 다음 프레임(또는 다음 프레임 섹션) 예측 모델을 학습하기 위한 신경망 구조인 FramePack을 제안합니다. FramePack은 입력 프레임을 압축하여 비디오 길이와 상관없이 트랜스포머 컨텍스트 길이를 고정된 수로 만듭니다. 이를 통해 이미지 확산과 유사한 계산 병목 현상을 유지하면서도 많은 수의 프레임을 처리할 수 있습니다. 또한 이로 인해 학습 비디오 배치 크기가 크게 증가하며(배치 크기가 이미지 확산 학습과 비슷해짐), 노출 편향(반복에 따른 오류 누적)을 방지하기 위해 초기 설정된 종료점과 함께 역시간 순서로 프레임을 생성하는 안티 드리프팅 샘플링 방법을 제안합니다. 마지막으로, 기존 비디오 확산 모델을 FramePack으로 미세 조정할 수 있으며, 다음 프레임 예측이 더 균형 잡힌 확산 스케줄러와 덜 극단적인 흐름 이동 시간 단계를 지원함으로써 시각적 품질이 개선될 수 있음을 보여줍니다.

English

We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

비디오 생성을 위한 다음 프레임 예측 모델에서 입력 프레임 컨텍스트 패킹

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

초록

Support