TADA!通过激活引导调谐音频扩散模型
TADA! Tuning Audio Diffusion Models through Activation Steering
February 12, 2026
作者: Łukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski, Kamil Deja
cs.AI
摘要
音频扩散模型能够根据文本合成高保真音乐,但其表征高层概念的内部机制仍不甚明晰。本研究通过激活修补技术证明,在尖端音频扩散架构中,特定语义音乐概念(如特定乐器的存在、人声或流派特征)由注意力层中一个较小的共享子集控制。进一步研究表明,在这些关键层应用对比性激活增强与稀疏自编码器可实现对生成音频的更精准控制,印证了专业化现象的直接效益。通过调控已识别层的激活状态,我们能够高精度改变特定音乐元素,例如调节乐曲速度或改变音轨情绪。
English
Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.