SongGen：用於文本到歌曲生成的單階段自回歸變壓器

摘要

文本到歌曲生成，即從文本輸入中創造人聲和伴奏的任務，由於領域的複雜性和數據的稀缺性，面臨著重大挑戰。現有方法通常採用多階段生成程序，導致訓練和推理流程繁瑣。本文中，我們提出了SongGen，一個完全開源、單階段的自迴歸變壓器模型，專為可控歌曲生成而設計。該模型能夠精細控制多樣化的音樂屬性，包括歌詞及對樂器配置、流派、情緒和音色的文本描述，同時還提供一個可選的三秒參考片段用於聲音克隆。在統一的自迴歸框架內，SongGen支持兩種輸出模式：混合模式，直接生成人聲與伴奏的混合；雙軌模式，分別合成人聲和伴奏，為下游應用提供更大的靈活性。我們針對每種模式探索了多樣的令牌模式策略，帶來了顯著的改進和寶貴的見解。此外，我們設計了一個自動化的數據預處理流程，並實施了有效的質量控制。為促進社區參與和未來研究，我們將公開模型權重、訓練代碼、註釋數據及預處理流程。生成的樣本展示於我們的項目頁面https://liuzh-19.github.io/SongGen/，代碼將在https://github.com/LiuZH-19/SongGen 提供。

English

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .

SongGen：用於文本到歌曲生成的單階段自回歸變壓器

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

摘要

Support