スタック・アンド・ディレイ：音楽生成のための新しいコードブックパターン

要旨

言語モデリングに基づく音楽生成において、生成された波形は階層的なトークンスタックのシーケンスとして表現され、コードブックのパターンに応じて自己回帰的または並列的にデコードされます。特に、コードブックをフラット化することは最高品質のデコード戦略を表しますが、非常に遅いことで知られています。このため、我々は新しいスタック・アンド・ディレイスタイルのデコード戦略を提案し、フラットパターンデコードよりも生成速度が4倍速い方法を改善します。これにより、推論時間がディレイデコード戦略に近づき、小規模なバッチサイズでのGPU上での推論が高速化されます。ディレイパターンと同じ推論効率予算において、提案手法は客観的評価でより優れた性能を示し、品質面ではフラットパターンとの差をほぼ埋めます。この結果は主観的評価によっても裏付けられており、同じテキストプロンプトを与えた場合、新モデルによって生成されたサンプルが競合モデルのサンプルよりもわずかに好まれることが示されています。

English

In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts.

スタック・アンド・ディレイ：音楽生成のための新しいコードブックパターン

Stack-and-Delay: a new codebook pattern for music generation

要旨

Support