희소 오토인코더를 활용한 텍스트-음성 언어 모델의 해석 및 유도

초록

언어 모델은 점차 텍스트-음성 변환(TTS) 시스템의 백본 역할을 하고 있지만, 텍스트와 생성된 음성 토큰이 단일 잔차 스트림을 공유할 때 구축하는 표현에 대해서는 거의 알려져 있지 않다. 우리는 CosyVoice3의 LM 백본에 대해 BatchTopK 희소 오토인코더를 학습시키고, 각 특징이 발화된 위치(텍스트 접두사 컨텍스트, 1초 음성 클립, 또는 둘 다)에 따라 레이블을 지정하는 양식 인식 자동 해석 파이프라인을 도입한다. 복구된 특징은 음소, 웃음, 억양 프롬프트 및 화자 성별에 걸쳐 해석 가능하다. SAE 잠재 공간을 통한 조종은 이러한 특징이 단순히 기술적이기보다 인과적임을 보여준다: 표적 개입을 통해 웃음 확률을 0.02에서 0.79로 높이고, 인지된 화자 성별을 전환하며, 음성 내용을 유지하면서 발화 속도를 제어한다. 따라서 SAE 특징은 TTS 합성을 위한 해석 가능성 객체이자 제어 방향으로 기능한다.

English

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.