Seedance 1.5 pro：原生音视频联合生成基础模型

摘要

视频生成领域的最新进展为音视频统一生成开辟了道路。本研究推出Seedance 1.5 pro——一个专为原生音视频联合生成设计的基础模型。该模型采用双分支扩散Transformer架构，通过跨模态联合模块与专业化多阶段数据管道的协同整合，实现了卓越的视听同步效果与顶尖生成质量。为确保实际应用价值，我们实施了精细的训后优化策略，包括基于高质量数据集的监督微调（SFT）以及采用多维度奖励模型的人类反馈强化学习（RLHF）。此外，我们引入了加速推理框架，使生成速度提升超10倍。Seedance 1.5 pro凭借精准的多语言及方言口型同步、动态电影级运镜控制和增强的叙事连贯性，成为专业级内容创作的强大引擎。该模型现已在火山引擎平台开放访问：https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo。

English

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.