SongGen: テキストから楽曲生成のための単一段階自己回帰型トランスフォーマー

要旨

テキストから楽曲を生成するタスク、すなわちテキスト入力を基にボーカルと伴奏を作成する作業は、その領域の複雑さとデータの不足により、大きな課題を抱えています。既存のアプローチでは、多段階の生成プロセスを採用することが多く、その結果、煩雑なトレーニングと推論のパイプラインが生じています。本論文では、制御可能な楽曲生成のために設計された、完全にオープンソースの単一段階自己回帰型トランスフォーマーであるSongGenを提案します。提案モデルは、歌詞や楽器編成、ジャンル、ムード、音色などのテキスト記述を含む多様な音楽的属性に対するきめ細かい制御を可能にし、さらに3秒間の参照クリップを用いたボイスクローニングのオプションも提供します。統一された自己回帰型フレームワーク内で、SongGenは2つの出力モードをサポートします：ボーカルと伴奏を直接ミックスして生成するミックスモードと、それらを別々に合成して下流アプリケーションでの柔軟性を高めるデュアルトラックモードです。各モードに対して多様なトークンパターン戦略を探求し、顕著な改善と貴重な知見を得ました。さらに、効果的な品質管理を備えた自動化されたデータ前処理パイプラインを設計しました。コミュニティの参加と将来の研究を促進するため、モデルの重み、トレーニングコード、注釈付きデータ、前処理パイプラインを公開します。生成されたサンプルはプロジェクトページ（https://liuzh-19.github.io/SongGen/）で公開され、コードは（https://github.com/LiuZH-19/SongGen）で利用可能になります。

English

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .

SongGen: テキストから楽曲生成のための単一段階自己回帰型トランスフォーマー

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

要旨

Support