UniVerse-1：通过专家模型拼接实现音视频统一生成

摘要

我们推出了UniVerse-1，这是一个类似于Veo-3的统一模型，能够同时生成协调的音频和视频。为了提高训练效率，我们跳过了从头训练的过程，转而采用了专家拼接（SoE）技术。该方法深度融合了预训练的视频和音乐生成专家模型的对应模块，从而充分利用了它们的基础能力。为了确保环境音和语音与视频内容的准确标注和时间对齐，我们开发了一个在线标注流程，该流程在训练过程中处理所需的训练数据并生成标签。这一策略避免了因基于文本的标注错位而导致的性能下降。通过这些技术的协同作用，我们的模型在经过约7,600小时的音视频数据微调后，在环境音生成方面产生了音视频协调良好的结果，在语音生成方面也表现出强大的对齐能力。为了系统评估我们提出的方法，我们引入了Verse-Bench，这是一个新的基准数据集。为了推动音视频生成领域的研究，并缩小与Veo3等最先进模型的性能差距，我们公开了我们的模型和代码。我们希望这一贡献能够惠及更广泛的研究社区。项目页面：https://dorniwang.github.io/UniVerse-1/。

English

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

UniVerse-1：通过专家模型拼接实现音视频统一生成

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

摘要

Support