FocalCodec-Stream: 因果的蒸留による低ビットレート音声符号化のストリーミング

要旨

ニューラル音声コーデックは、現代の生成音声パイプラインにおける基本的な構成要素です。最近のコーデックは低ビットレートでの再構成性能が高く、下流タスクのための強力な表現を提供しますが、その多くはストリーミング対応ではないため、リアルタイムアプリケーションでの使用が制限されています。本論文では、フォーカル変調に基づくハイブリッドコーデックであるFocalCodec-Streamを提案します。これは、音声を0.55～0.80 kbpsの単一のバイナリコードブックに圧縮し、理論的なレイテンシは80 msです。我々のアプローチは、WavLMの多段階因果蒸留と、レイテンシ制約下での品質を向上させる軽量なリファイナモジュールを含む、ターゲットを絞ったアーキテクチャ改良を組み合わせています。実験結果から、FocalCodec-Streamは、同等のビットレートにおいて既存のストリーミング対応コーデックを上回り、意味情報と音響情報の両方を保持することが示されています。その結果、再構成品質、下流タスクの性能、レイテンシ、効率性の間で良好なトレードオフが実現されています。コードとチェックポイントはhttps://github.com/lucadellalib/focalcodecで公開されます。

English

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

FocalCodec-Stream: 因果的蒸留による低ビットレート音声符号化のストリーミング

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

要旨

Support