FocalCodec-Stream:通過因果蒸餾實現低比特率語音編碼的流式處理
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
September 19, 2025
作者: Luca Della Libera, Cem Subakan, Mirco Ravanelli
cs.AI
摘要
神經音頻編解碼器是現代生成式音頻管線的核心組件。儘管近期的編解碼器在低比特率重建方面表現出色,並為下游任務提供了強大的表示能力,但大多數並不支持流式處理,這限制了它們在實時應用中的使用。我們提出了FocalCodec-Stream,這是一種基於焦點調製的混合編解碼器,能夠將語音壓縮至0.55至0.80 kbps的單一二進制碼本,理論延遲為80毫秒。我們的方法結合了WavLM的多階段因果蒸餾與針對性的架構改進,包括一個輕量級的優化模塊,該模塊在延遲約束下提升了音質。實驗表明,FocalCodec-Stream在可比比特率下優於現有的流式編解碼器,同時保留了語義和聲學信息。這實現了重建質量、下游任務性能、延遲和效率之間的有利平衡。代碼和檢查點將在https://github.com/lucadellalib/focalcodec發布。
English
Neural audio codecs are a fundamental component of modern generative audio
pipelines. Although recent codecs achieve strong low-bitrate reconstruction and
provide powerful representations for downstream tasks, most are non-streamable,
limiting their use in real-time applications. We present FocalCodec-Stream, a
hybrid codec based on focal modulation that compresses speech into a single
binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our
approach combines multi-stage causal distillation of WavLM with targeted
architectural improvements, including a lightweight refiner module that
enhances quality under latency constraints. Experiments show that
FocalCodec-Stream outperforms existing streamable codecs at comparable
bitrates, while preserving both semantic and acoustic information. The result
is a favorable trade-off between reconstruction quality, downstream task
performance, latency, and efficiency. Code and checkpoints will be released at
https://github.com/lucadellalib/focalcodec.