ChatPaper.aiChatPaper

HeartMuLa:开源音乐基础模型系列

HeartMuLa: A Family of Open Sourced Music Foundation Models

January 15, 2026
作者: Dongchao Yang, Yuxin Xie, Yuguo Yin, Zheyu Wang, Xiaoyu Yi, Gongxi Zhu, Xiaolong Weng, Zihan Xiong, Yingzhe Ma, Dading Cong, Jingliang Liu, Zihang Huang, Jinghan Ru, Rongjie Huang, Haoran Wan, Peixu Wang, Kuoxi Yu, Helin Wang, Liming Liang, Xianwei Zhuang, Yuanyuan Wang, Haohan Guo, Junjie Cao, Zeqian Ju, Songxiang Liu, Yuewen Cao, Heming Weng, Yuexian Zou
cs.AI

摘要

我们推出了一系列开源音乐基础模型,旨在推动跨任务与多模态的大规模音乐理解与生成。该框架包含四大核心组件:(1) HeartCLAP音频-文本对齐模型;(2) HeartTranscriptor面向真实音乐场景优化的强健歌词识别模型;(3) HeartCodec低帧率(12.5Hz)高保真音乐编解码器,在捕捉长程音乐结构的同时保留细粒度声学细节,并支持高效自回归建模;(4) HeartMuLa基于大语言模型的歌曲生成模型,可在丰富用户可控条件下(如文本风格描述、歌词及参考音频)合成高保真音乐。此外,该模型提供两种专项模式:(i)细粒度音乐属性控制,允许用户通过自然语言指令指定不同歌曲段落(如前奏、主歌、副歌)的风格;(ii)短时长趣味音乐生成,适用于短视频背景音乐场景。值得注意的是,当参数规模扩展至70亿时,HeartMuLa性能实现显著提升。我们首次证明,利用学术级数据与GPU资源即可复现达到Suno级别的商业级系统。期待这些基础模型能为未来研究提供强基准,并推动多模态内容生产的实际应用。
English
We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.
PDF81January 17, 2026