FocalCodec-Stream：通過因果蒸餾實現低比特率語音編碼的流式處理

摘要

神經音頻編解碼器是現代生成式音頻管線的核心組件。儘管近期的編解碼器在低比特率重建方面表現出色，並為下游任務提供了強大的表示能力，但大多數並不支持流式處理，這限制了它們在實時應用中的使用。我們提出了FocalCodec-Stream，這是一種基於焦點調製的混合編解碼器，能夠將語音壓縮至0.55至0.80 kbps的單一二進制碼本，理論延遲為80毫秒。我們的方法結合了WavLM的多階段因果蒸餾與針對性的架構改進，包括一個輕量級的優化模塊，該模塊在延遲約束下提升了音質。實驗表明，FocalCodec-Stream在可比比特率下優於現有的流式編解碼器，同時保留了語義和聲學信息。這實現了重建質量、下游任務性能、延遲和效率之間的有利平衡。代碼和檢查點將在https://github.com/lucadellalib/focalcodec發布。

English

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

FocalCodec-Stream：通過因果蒸餾實現低比特率語音編碼的流式處理

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

摘要

Support