ChatPaper.aiChatPaper

UniVerse-1:通過專家模型拼接實現的統一音視頻生成

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

September 7, 2025
作者: Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu
cs.AI

摘要

我們推出UniVerse-1,這是一個類似Veo-3的統一模型,能夠同時生成協調的音頻和視頻。為了提升訓練效率,我們跳過了從零開始的訓練,而是採用了專家拼接(SoE)技術。這種方法深度融合了預訓練視頻和音樂生成專家模型的相應模塊,從而充分利用了它們的基礎能力。為了確保環境音和語音與視頻內容的準確註釋和時間對齊,我們開發了一個在線註釋流程,該流程在訓練過程中處理所需的訓練數據並生成標籤。這一策略避免了基於文本的註釋不準確常導致的性能下降。通過這些技術的協同作用,我們的模型在約7,600小時的音視頻數據上進行微調後,能夠生成環境音時音視頻協調良好、語音生成時對齊精確的結果。為了系統地評估我們提出的方法,我們引入了Verse-Bench,這是一個新的基準數據集。為了推動音視頻生成領域的研究並縮小與Veo3等最先進模型的性能差距,我們公開了我們的模型和代碼。我們希望這一貢獻能惠及更廣泛的研究社區。項目頁面:https://dorniwang.github.io/UniVerse-1/。
English
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.
PDF132September 9, 2025