低ビットレート高品質音声符号化のためのTransformerのスケーリング

要旨

ニューラルオーディオコーデックモデルによる音声のトークン化は、音声の生成または理解のための現代のAIパイプラインにおいて重要な要素であり、単独であるか、またはマルチモーダルなコンテキストで行われます。従来、このようなトークン化モデルは、強い帰紵バイアスを持つコンポーネントのみを使用した低パラメータ数のアーキテクチャに集中してきました。本研究では、大規模なパラメータ数を持つトランスフォーマーアーキテクチャをこの問題にスケーリングし、柔軟な有限スカラー量子化（FSQ）ベースのボトルネックを適用することで、非常に低いビットレート（400または700ビット/秒）で最先端の音声品質に到達することが可能であることを示します。訓練されたモデルは、客観的および主観的テストの両方で既存のベースラインを大きく上回る結果を示しました。

English

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of 400 or 700 bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

低ビットレート高品質音声符号化のためのTransformerのスケーリング

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

要旨

Support