使用稀疏自编码器解释与引导文本到语音语言模型

摘要

语言模型日益成为文本转语音（TTS）系统的核心支撑，但我们对它们在文本和生成的语音令牌共享同一残差流时所构建的表征方式仍知之甚少。我们在CosyVoice3的语言模型主干上训练了BatchTopK稀疏自编码器，并引入了一种模态感知的自动解释流程，为每个特征标注其触发来源——文本前缀上下文、1秒语音片段或两者兼有。恢复的特征具有可解释性，涵盖音素、笑声、口音提示和说话者性别。通过自编码器隐空间进行引导表明，这些特征不仅具有描述性，更具备因果性：针对性干预将笑声概率从0.02提升至0.79，翻转感知到的说话者性别，并在保留语音内容的同时控制语速。因此，稀疏自编码器特征既可作为可解释性研究对象，也可作为TTS合成的控制方向。

English

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.