SANA-Video: 블록 선형 확산 트랜스포머를 활용한 효율적인 비디오 생성

초록

우리는 720x1280 해상도와 최대 1분 길이의 동영상을 효율적으로 생성할 수 있는 소형 확산 모델인 SANA-Video를 소개합니다. SANA-Video는 RTX 5090 GPU에서 배포 가능한 빠른 속도로 고해상도, 고품질의 긴 동영상을 생성하며, 텍스트-비디오 정렬이 뛰어납니다. 효율적이고 효과적이며 긴 동영상 생성을 보장하는 두 가지 핵심 설계는 다음과 같습니다: (1) Linear DiT: 비디오 생성 시 처리되는 대량의 토큰을 고려하여, 기본 어텐션보다 더 효율적인 선형 어텐션을 핵심 연산으로 활용합니다. (2) Block Linear Attention을 위한 Constant-Memory KV 캐시: 선형 어텐션의 누적 특성에서 도출된 고정 메모리 상태를 사용하여 긴 동영상 생성을 위한 블록 단위 자기회귀 방식을 설계합니다. 이 KV 캐시는 Linear DiT에 고정 메모리 비용으로 글로벌 컨텍스트를 제공하며, 전통적인 KV 캐시의 필요성을 없애고 효율적인 1분 길이의 동영상 생성을 가능하게 합니다. 또한, 효과적인 데이터 필터와 모델 학습 전략을 탐구하여 64개의 H100 GPU에서 학습 비용을 12일로 줄였으며, 이는 MovieGen의 비용의 1%에 불과합니다. 이러한 낮은 비용에도 불구하고, SANA-Video는 현대의 최첨단 소형 확산 모델(예: Wan 2.1-1.3B 및 SkyReel-V2-1.3B)과 비교하여 경쟁력 있는 성능을 달성하면서 측정된 지연 시간에서 16배 더 빠릅니다. 더욱이, SANA-Video는 NVFP4 정밀도로 RTX 5090 GPU에 배포될 수 있으며, 5초 길이의 720p 동영상 생성 추론 속도를 71초에서 29초로 가속화합니다(2.4배 속도 향상). 요약하자면, SANA-Video는 낮은 비용으로 고품질의 동영상 생성을 가능하게 합니다.

English

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.