堆叠延迟:一种用于音乐生成的新编码模式
Stack-and-Delay: a new codebook pattern for music generation
September 15, 2023
作者: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra
cs.AI
摘要
基于语言建模的音乐生成中,生成的波形由一系列分层的令牌堆栈表示,可以根据码书模式以自回归方式或并行方式解码。特别是,展平码书代表了最高质量的解码策略,但其速度极慢。为此,我们提出了一种新颖的堆栈延迟式解码策略,以改进展平模式解码,在生成速度方面,与普通展平解码相比快四倍。这将推断时间接近延迟解码策略的水平,并允许在小批量大小的GPU上进行更快的推断。在与延迟模式相同的推断效率预算下,我们展示了所提出的方法在客观评估中表现更好,几乎在质量方面与展平模式拉近了差距。主观评估结果证实,相同的文本提示下,新模型生成的样本更受青睐,而不是竞争模型生成的样本。
English
In language modeling based music generation, a generated waveform is
represented by a sequence of hierarchical token stacks that can be decoded
either in an auto-regressive manner or in parallel, depending on the codebook
patterns. In particular, flattening the codebooks represents the highest
quality decoding strategy, while being notoriously slow. To this end, we
propose a novel stack-and-delay style of decoding strategy to improve upon the
flat pattern decoding where generation speed is four times faster as opposed to
vanilla flat decoding. This brings the inference time close to that of the
delay decoding strategy, and allows for faster inference on GPU for small batch
sizes. For the same inference efficiency budget as the delay pattern, we show
that the proposed approach performs better in objective evaluations, almost
closing the gap with the flat pattern in terms of quality. The results are
corroborated by subjective evaluations which show that samples generated by the
new model are slightly more often preferred to samples generated by the
competing model given the same text prompts.