FocalCodec：透過焦點調變網路進行低比特率語音編碼

摘要

大型語言模型通過在龐大數據集上進行自監督預訓練，徹底改變了自然語言處理。受到這一成功的啟發，研究人員探索將這些方法應用於語音，通過使用神經音頻編解碼器將連續音頻離散化為標記。然而，現有方法存在一些限制，包括高比特率、在捕捉兩者時要麼丟失語義或聲學信息、以及依賴多編碼書設計，這增加了下游任務的架構復雜性。為應對這些挑戰，我們引入了 FocalCodec，這是一種基於焦點調製的高效低比特率編解碼器，利用單一二進制編碼書將語音壓縮在 0.16 到 0.65 kbps 之間。FocalCodec 在語音重合成和語音轉換方面表現出色，比當前最先進技術實現更低的比特率，同時有效處理多語言語音和嘈雜環境。對下游任務的評估顯示，FocalCodec 成功保留了足夠的語義和聲學信息，同時也非常適合生成建模。演示樣本、代碼和檢查點可在 https://lucadellalib.github.io/focalcodec-web/ 上找到。

English

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.