堆疊與延遲:音樂生成的新編碼本模式
Stack-and-Delay: a new codebook pattern for music generation
September 15, 2023
作者: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra
cs.AI
摘要
在基於語言建模的音樂生成中,生成的波形由一系列階層式的標記堆棧表示,可以根據代碼本模式以自回歸方式或並行方式進行解碼。具體來說,將代碼本扁平化代表了最高質量的解碼策略,但也以解碼速度緩慢而聞名。為此,我們提出了一種新穎的堆棧延遲式解碼策略,以改進扁平模式解碼,生成速度比普通扁平解碼快四倍。這將推斷時間接近延遲解碼策略的水準,並允許在小批量情況下在 GPU 上進行更快的推斷。在與延遲模式相同的推斷效率預算下,我們展示了所提出的方法在客觀評估中表現更好,幾乎在質量方面與扁平模式拉近了差距。主觀評估證實了這些結果,顯示新模型生成的樣本在相同文本提示下更受偏好,相對於競爭模型生成的樣本。
English
In language modeling based music generation, a generated waveform is
represented by a sequence of hierarchical token stacks that can be decoded
either in an auto-regressive manner or in parallel, depending on the codebook
patterns. In particular, flattening the codebooks represents the highest
quality decoding strategy, while being notoriously slow. To this end, we
propose a novel stack-and-delay style of decoding strategy to improve upon the
flat pattern decoding where generation speed is four times faster as opposed to
vanilla flat decoding. This brings the inference time close to that of the
delay decoding strategy, and allows for faster inference on GPU for small batch
sizes. For the same inference efficiency budget as the delay pattern, we show
that the proposed approach performs better in objective evaluations, almost
closing the gap with the flat pattern in terms of quality. The results are
corroborated by subjective evaluations which show that samples generated by the
new model are slightly more often preferred to samples generated by the
competing model given the same text prompts.