Seedance 1.5 pro: 네이티브 오디오-비주얼 통합 생성 기반 모델

초록

비디오 생성 분야의 최근 발전은 통합 오디오-비주얼 생성의 길을 열었습니다. 본 연구에서는 기본적인 오디오-비디오 결합 생성을 위해 특별히 설계된 파운데이션 모델인 Seedance 1.5 pro를 소개합니다. 듀얼-브랜치 Diffusion Transformer 아키텍처를 활용한 이 모델은 크로스 모달 결합 모듈과 전문적인 다단계 데이터 파이프라인을 통합하여 탁월한 오디오-비주얼 싱크로나이제이션과 우수한 생성 품질을 달성했습니다. 실용적인 유용성을 보장하기 위해 고품질 데이터셋을 활용한 지도 미세 조정(SFT) 및 다차원 보상 모델을 통한 인간 피드백 강화 학습(RLHF)을 포함한 세심한 사후 훈련 최적화를 구현했습니다. 더불어 추론 속도를 10배 이상 향상시키는 가속화 프레임워크를 도입했습니다. Seedance 1.5 pro는 정확한 다국어 및 방언 립싱크, 동적인 시네마틱 카메라 제어, 향상된 내러티브 일관성을 통해 차별화되어 전문가급 콘텐츠 제작을 위한 강력한 엔진으로 자리매김합니다. Seedance 1.5 pro는 현재 Volcano Engine(https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo)에서 이용 가능합니다.

English

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

Seedance 1.5 pro: 네이티브 오디오-비주얼 통합 생성 기반 모델

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

초록

Support