再瓶颈化:神經音頻自動編碼器的潛在重構策略
Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders
July 10, 2025
作者: Dimitrios Bralios, Jonah Casebeer, Paris Smaragdis
cs.AI
摘要
神經音頻編解碼器與自編碼器已成為音頻壓縮、傳輸、特徵提取及潛在空間生成的多功能模型。然而,其關鍵限制在於大多數模型訓練時以最大化重建保真度為目標,往往忽視了在多樣下游應用中實現最佳性能所需的特定潛在結構。我們提出了一個簡單的事後框架來解決這一問題,通過修改預訓練自編碼器的瓶頸部分。我們的方法引入了一種“重瓶頸”機制,這是一個僅通過潛在空間損失進行訓練的內部瓶頸,旨在植入用戶定義的結構。我們通過三個實驗展示了該框架的有效性。首先,我們在不犧牲重建質量的前提下,對潛在通道施加了排序。其次,我們將潛在變量與語義嵌入對齊,分析其對下游擴散建模的影響。第三,我們引入了等變性,確保輸入波形上的濾波操作直接對應於潛在空間中的特定變換。最終,我們的重瓶頸框架提供了一種靈活且高效的方式來定制神經音頻模型的表示,使其能夠以最少的額外訓練無縫滿足不同應用的多樣化需求。
English
Neural audio codecs and autoencoders have emerged as versatile models for
audio compression, transmission, feature-extraction, and latent-space
generation. However, a key limitation is that most are trained to maximize
reconstruction fidelity, often neglecting the specific latent structure
necessary for optimal performance in diverse downstream applications. We
propose a simple, post-hoc framework to address this by modifying the
bottleneck of a pre-trained autoencoder. Our method introduces a
"Re-Bottleneck", an inner bottleneck trained exclusively through latent space
losses to instill user-defined structure. We demonstrate the framework's
effectiveness in three experiments. First, we enforce an ordering on latent
channels without sacrificing reconstruction quality. Second, we align latents
with semantic embeddings, analyzing the impact on downstream diffusion
modeling. Third, we introduce equivariance, ensuring that a filtering operation
on the input waveform directly corresponds to a specific transformation in the
latent space. Ultimately, our Re-Bottleneck framework offers a flexible and
efficient way to tailor representations of neural audio models, enabling them
to seamlessly meet the varied demands of different applications with minimal
additional training.