再瓶颈化：面向神经音频自编码器的潜在结构重构

摘要

神经音频编解码器和自编码器已成为音频压缩、传输、特征提取以及潜在空间生成的多功能模型。然而，一个关键局限在于，大多数模型在训练时以最大化重建保真度为目标，往往忽视了不同下游应用所需的具体潜在结构。我们提出了一种简单的事后处理框架，通过修改预训练自编码器的瓶颈层来解决这一问题。我们的方法引入了一个“再瓶颈”机制，即仅通过潜在空间损失进行训练的内部瓶颈，以注入用户定义的结构。我们在三个实验中展示了该框架的有效性。首先，我们在不牺牲重建质量的前提下，对潜在通道施加了排序。其次，我们将潜在表示与语义嵌入对齐，分析其对下游扩散建模的影响。第三，我们引入了等变性，确保输入波形上的滤波操作直接对应于潜在空间中的特定变换。最终，我们的再瓶颈框架提供了一种灵活且高效的方式，用于定制神经音频模型的表示，使其能够以最少的额外训练无缝适应不同应用的多样化需求。

English

Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.

再瓶颈化：面向神经音频自编码器的潜在结构重构

Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

摘要

Support