スパースオートエンコーダを用いたテキスト音声合成言語モデルの解釈と制御

要旨

言語モデルはテキスト音声合成（TTS）システムの基盤としてますます重要な役割を果たしているが、テキストと生成された音声トークンが単一の残差ストリームを共有する際にモデルが構築する表現については、ほとんど理解されていない。我々はCosyVoice3の言語モデルバックボーンにBatchTopKスパースオートエンコーダを学習させ、各特徴量がテキスト前置コンテキスト、1秒の音声クリップ、またはその両方のいずれで発火するかをラベル付けする、モダリティ対応自動解釈パイプラインを導入する。得られた特徴量は解釈可能であり、音素、笑い声、アクセントプロンプト、話者の性別にわたる。SAE潜在空間を通じた操作は、これらの特徴量が単なる記述的なものではなく因果的であることを示す。標的を絞った介入により、笑い声の確率が0.02から0.79に上昇し、知覚される話者の性別が反転し、発話内容を保持したまま発話速度が制御される。したがって、SAE特徴量はTTS合成における解釈可能性オブジェクトとしても制御方向としても機能する。

English

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.