ChatPaper.aiChatPaper

MOVA:邁向可擴展且同步的視訊-音訊生成

MOVA: Towards Scalable and Synchronized Video-Audio Generation

February 9, 2026
作者: SII-OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyang Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang, Chushu Zhou, Hanfu Chen, Hongnan Ma, Jiaxi Li, Jingqi Tong, Junxi Liu, Ke Chen, Shimin Li, Songlin Wang, Wei Jiang, Zhaoye Fei, Zhiyuan Ning, Chunguo Li, Chenhui Li, Ziwei He, Zengfeng Huang, Xie Chen, Xipeng Qiu
cs.AI

摘要

音訊在現實世界影片中不可或缺,然而生成模型長期以來大多忽略了音訊元件。當前製作視聽內容的方法通常依賴級聯式流程,這不僅增加成本、導致誤差累積,更會降低整體品質。儘管Veo 3和Sora 2等系統強調同步生成的價值,但聯合多模態建模在架構、數據和訓練方面仍存在獨特挑戰。此外,現有系統的閉源特性限制了該領域的發展。本研究推出開源模型MOVA(MOSS視聽生成模型),能生成高品質同步視聽內容,包括逼真的唇語同步語音、環境感知音效與內容契合的配樂。MOVA採用混合專家架構,總參數量達320億,其中推理時激活180億參數,支援「圖文生成視聽」任務。透過公開模型權重與程式碼,我們期望推動研究發展並培育創意社群。開源程式庫具備完整功能,支援高效推理、LoRA微調及提示詞增強。
English
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
PDF1422February 11, 2026