ChatPaper.aiChatPaper

YuE:面向长篇音乐生成的可扩展开放基础模型

YuE: Scaling Open Foundation Models for Long-Form Music Generation

March 11, 2025
作者: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang, Yatian Wang, Xiaowei Chi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Shansong Liu, Lingrui Mei, Peng Li, Junjie Wang, Jianwei Yu, Guojian Pang, Xu Li, Zihao Wang, Xiaohuan Zhou, Lijun Yu, Emmanouil Benetos, Yong Chen, Chenghua Lin, Xie Chen, Gus Xia, Zhaoxiang Zhang, Chao Zhang, Wenhu Chen, Xinyu Zhou, Xipeng Qiu, Roger Dannenberg, Jiaheng Liu, Jian Yang, Wenhao Huang, Wei Xue, Xu Tan, Yike Guo
cs.AI

摘要

我們致力於解決長篇音樂生成任務——尤其是極具挑戰性的歌詞轉歌曲問題——通過引入YuE,這是一個基於LLaMA2架構的開放基礎模型家族。具體而言,YuE能夠擴展至處理萬億級別的數據量,並生成長達五分鐘的音樂,同時保持歌詞對齊、連貫的音樂結構以及引人入勝的聲樂旋律與恰當的伴奏。它實現這一目標依賴於:(1) 軌道解耦的下一個令牌預測,以克服密集混合信號的難題;(2) 結構漸進條件化,確保長上下文中的歌詞對齊;(3) 多任務、多階段的預訓練策略,促進模型的收斂與泛化。此外,我們重新設計了音樂生成中的上下文學習技術,實現了多樣化的風格轉換(例如,將日本城市流行轉化為英語說唱,同時保留原伴奏)以及雙向生成。通過廣泛的評估,我們證明YuE在音樂性和聲樂靈活性方面與甚至超越了一些專有系統。此外,對YuE進行微調能夠實現額外的控制並增強對小眾語言的支援。更進一步,除了生成功能,我們展示了YuE學習到的表徵在音樂理解任務上表現出色,在MARBLE基準測試中,YuE的結果與或超過了現有的最先進方法。關鍵詞:歌詞轉歌曲、歌曲生成、長篇、基礎模型、音樂生成
English
We tackle the task of long-form music generation--particularly the challenging lyrics-to-song problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

Summary

AI-Generated Summary

PDF642March 12, 2025