FocalCodec-Stream：通过因果蒸馏实现低比特率语音编码的流式传输

摘要

神经音频编解码器是现代生成式音频流水线的核心组件。尽管近期编解码器在低比特率重建方面表现出色，并为下游任务提供了强大的表征能力，但大多数方案无法实现流式处理，限制了其在实时应用中的使用。我们提出了FocalCodec-Stream，这是一种基于焦点调制技术的混合编解码器，能够将语音压缩至0.55至0.80 kbps的单一二进制码本，理论延迟仅为80毫秒。我们的方法结合了WavLM的多阶段因果蒸馏与针对性的架构改进，包括一个轻量级的优化模块，在延迟限制下提升音质。实验表明，FocalCodec-Stream在相近比特率下优于现有的流式编解码器，同时保留了语义和声学信息。这一成果在重建质量、下游任务性能、延迟和效率之间实现了有利的平衡。代码和模型检查点将在https://github.com/lucadellalib/focalcodec 发布。

English

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

FocalCodec-Stream：通过因果蒸馏实现低比特率语音编码的流式传输

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

摘要

Support