使用稀疏自編碼器解讀與引導文字轉語音語言模型

摘要

語言模型日益成為文字轉語音（TTS）系統的核心骨幹，然而我們對於當文字與生成語音符號共享單一殘差流時，模型所建立的表徵卻所知甚少。我們在 CosyVoice3 的語言模型骨幹上訓練 BatchTopK 稀疏自編碼器，並引入一套具備模態感知能力的自動解釋管線，能依據每個特徵激發的來源——文字前綴上下文、1 秒語音片段或兩者兼具——進行標記。所還原的特徵具有可解釋性，涵蓋音素、笑聲、口音提示與說話者性別。透過在 SAE 潛在空間中進行操控，顯示這些特徵具有因果性而非僅是描述性：針對性干預將笑聲機率從 0.02 提升至 0.79，翻轉感知到的說話者性別，並能在保留說話內容的同時控制語速。因此，SAE 特徵既可作為可解釋性物件，也可作為 TTS 合成的控制方向。

English

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.