Seedance 1.0: 비디오 생성 모델의 경계 탐구

초록

확산 모델링 분야에서의 주목할 만한 돌파구는 비디오 생성 기술의 급속한 발전을 이끌었으나, 현재의 기초 모델들은 여전히 프롬프트 준수, 동작의 타당성, 그리고 시각적 품질을 동시에 균형 있게 유지하는 데 있어 중요한 과제에 직면해 있습니다. 본 보고서에서는 이러한 문제를 해결하기 위해 Seedance 1.0을 소개합니다. Seedance 1.0은 고성능이면서도 추론 효율이 뛰어난 비디오 기초 생성 모델로, 다음과 같은 핵심 기술적 개선 사항들을 통합하였습니다: (i) 정밀하고 의미 있는 비디오 캡셔닝을 통해 강화된 다중 소스 데이터 큐레이션으로, 다양한 시나리오에 걸친 포괄적인 학습이 가능하도록 하였습니다; (ii) 제안된 훈련 패러다임과 함께 효율적인 아키텍처 설계를 통해, 다중 샷 생성과 텍스트-투-비디오 및 이미지-투-비디오 작업의 공동 학습을 기본적으로 지원합니다; (iii) 세밀하게 최적화된 사후 훈련 접근법으로, 미세 조정된 감독 학습과 다차원 보상 메커니즘을 활용한 비디오 특화 RLHF(Reinforcement Learning with Human Feedback)를 통해 전반적인 성능 향상을 도모하였습니다; (iv) 다단계 증류 전략과 시스템 수준의 최적화를 통해 ~10배의 추론 속도 향상을 달성한 우수한 모델 가속 기술을 적용하였습니다. Seedance 1.0은 1080p 해상도의 5초 비디오를 단 41.4초 만에 생성할 수 있습니다(NVIDIA-L20 기준). 최신 비디오 생성 모델들과 비교했을 때, Seedance 1.0은 높은 품질과 빠른 생성 속도, 우수한 시공간적 유연성과 구조적 안정성, 복잡한 다중 주제 상황에서의 정확한 지시 준수, 그리고 일관된 주체 표현을 통한 기본적인 다중 샷 내러티브 일관성으로 두각을 나타냅니다.

English

Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

Seedance 1.0: 비디오 생성 모델의 경계 탐구

Seedance 1.0: Exploring the Boundaries of Video Generation Models

초록

Support