再瓶颈化:面向神经音频自编码器的潜在结构重构
Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders
July 10, 2025
作者: Dimitrios Bralios, Jonah Casebeer, Paris Smaragdis
cs.AI
摘要
神经音频编解码器和自编码器已成为音频压缩、传输、特征提取以及潜在空间生成的多功能模型。然而,一个关键局限在于,大多数模型在训练时以最大化重建保真度为目标,往往忽视了不同下游应用所需的具体潜在结构。我们提出了一种简单的事后处理框架,通过修改预训练自编码器的瓶颈层来解决这一问题。我们的方法引入了一个“再瓶颈”机制,即仅通过潜在空间损失进行训练的内部瓶颈,以注入用户定义的结构。我们在三个实验中展示了该框架的有效性。首先,我们在不牺牲重建质量的前提下,对潜在通道施加了排序。其次,我们将潜在表示与语义嵌入对齐,分析其对下游扩散建模的影响。第三,我们引入了等变性,确保输入波形上的滤波操作直接对应于潜在空间中的特定变换。最终,我们的再瓶颈框架提供了一种灵活且高效的方式,用于定制神经音频模型的表示,使其能够以最少的额外训练无缝适应不同应用的多样化需求。
English
Neural audio codecs and autoencoders have emerged as versatile models for
audio compression, transmission, feature-extraction, and latent-space
generation. However, a key limitation is that most are trained to maximize
reconstruction fidelity, often neglecting the specific latent structure
necessary for optimal performance in diverse downstream applications. We
propose a simple, post-hoc framework to address this by modifying the
bottleneck of a pre-trained autoencoder. Our method introduces a
"Re-Bottleneck", an inner bottleneck trained exclusively through latent space
losses to instill user-defined structure. We demonstrate the framework's
effectiveness in three experiments. First, we enforce an ordering on latent
channels without sacrificing reconstruction quality. Second, we align latents
with semantic embeddings, analyzing the impact on downstream diffusion
modeling. Third, we introduce equivariance, ensuring that a filtering operation
on the input waveform directly corresponds to a specific transformation in the
latent space. Ultimately, our Re-Bottleneck framework offers a flexible and
efficient way to tailor representations of neural audio models, enabling them
to seamlessly meet the varied demands of different applications with minimal
additional training.