FocalCodec-Stream:通过因果蒸馏实现低比特率语音编码的流式传输
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
September 19, 2025
作者: Luca Della Libera, Cem Subakan, Mirco Ravanelli
cs.AI
摘要
神经音频编解码器是现代生成式音频流水线的核心组件。尽管近期编解码器在低比特率重建方面表现出色,并为下游任务提供了强大的表征能力,但大多数方案无法实现流式处理,限制了其在实时应用中的使用。我们提出了FocalCodec-Stream,这是一种基于焦点调制技术的混合编解码器,能够将语音压缩至0.55至0.80 kbps的单一二进制码本,理论延迟仅为80毫秒。我们的方法结合了WavLM的多阶段因果蒸馏与针对性的架构改进,包括一个轻量级的优化模块,在延迟限制下提升音质。实验表明,FocalCodec-Stream在相近比特率下优于现有的流式编解码器,同时保留了语义和声学信息。这一成果在重建质量、下游任务性能、延迟和效率之间实现了有利的平衡。代码和模型检查点将在https://github.com/lucadellalib/focalcodec 发布。
English
Neural audio codecs are a fundamental component of modern generative audio
pipelines. Although recent codecs achieve strong low-bitrate reconstruction and
provide powerful representations for downstream tasks, most are non-streamable,
limiting their use in real-time applications. We present FocalCodec-Stream, a
hybrid codec based on focal modulation that compresses speech into a single
binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our
approach combines multi-stage causal distillation of WavLM with targeted
architectural improvements, including a lightweight refiner module that
enhances quality under latency constraints. Experiments show that
FocalCodec-Stream outperforms existing streamable codecs at comparable
bitrates, while preserving both semantic and acoustic information. The result
is a favorable trade-off between reconstruction quality, downstream task
performance, latency, and efficiency. Code and checkpoints will be released at
https://github.com/lucadellalib/focalcodec.