FocalCodec-Stream: 인과적 증류를 통한 저비트율 음성 코딩 스트리밍

초록

신경망 오디오 코덱은 현대 생성형 오디오 파이프라인의 핵심 구성 요소입니다. 최근 코덱들은 낮은 비트레이트에서 강력한 재구성 성능을 달성하고 다운스트림 작업을 위한 유용한 표현을 제공하지만, 대부분이 스트리밍이 불가능하여 실시간 애플리케이션에서의 사용이 제한됩니다. 본 논문에서는 포컬 변조(focal modulation) 기반의 하이브리드 코덱인 FocalCodec-Stream을 소개합니다. 이 코덱은 음성을 0.55 - 0.80 kbps의 단일 이진 코드북으로 압축하며, 이론적 지연 시간은 80ms입니다. 우리의 접근 방식은 WavLM의 다단계 인과적 증류(causal distillation)와 지연 시간 제약 하에서 품질을 향상시키는 경량 리파이너(refiner) 모듈을 포함한 표적 아키텍처 개선을 결합합니다. 실험 결과, FocalCodec-Stream은 유사한 비트레이트에서 기존 스트리밍 가능 코덱들을 능가하며, 의미론적 및 음향적 정보를 모두 보존합니다. 이는 재구성 품질, 다운스트림 작업 성능, 지연 시간 및 효율성 간의 유리한 균형을 제공합니다. 코드와 체크포인트는 https://github.com/lucadellalib/focalcodec에서 공개될 예정입니다.

English

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

FocalCodec-Stream: 인과적 증류를 통한 저비트율 음성 코딩 스트리밍

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

초록

Support