TADA！透過啟動引導技術調校音訊擴散模型

摘要

音频扩散模型能够根据文本合成高保真音乐，但其表征高层概念的内在机制仍未被充分理解。本研究通过激活修补技术证明，在先进音频扩散架构中，特定语义音乐概念（如特定乐器、人声或流派特征的存在）由注意力层中一个较小的共享子集控制。进一步研究表明，在这些层级应用对比性激活增强与稀疏自编码器能实现对生成音频的更精确控制，印证了专业化现象的直接益处。通过引导已识别层级的激活，我们能够高精度调整特定音乐元素，例如调节节奏或改变曲目情绪。

English

Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.

TADA！透過啟動引導技術調校音訊擴散模型

TADA! Tuning Audio Diffusion Models through Activation Steering

摘要

Support