口音向量:无需口音数据实现多语言TTS的可控口音操控
Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
March 8, 2026
作者: Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan
cs.AI
摘要
口音作为社会结构的重要组成部分,既折射多元文化特征,也塑造着个体的身份表达方式。尽管全球多数英语使用者为非母语人士,但由于口音数据匮乏,当前文本转语音系统主要基于美式口音建模。我们提出"口音向量"这一可控表征方法,可在无需口音训练数据的前提下实现多语言TTS系统的口音操控。该技术通过在不同语言的母语语音上微调TTS模型,并计算捕捉口音特征的任务向量(以英语为例),实现跨语言口音迁移。通过向量缩放与插值运算,我们不仅能精准控制口音强度,还能生成混合口音语音。该方法具备跨语言泛化能力,可应用于多语种口音控制。客观指标与人工评估均证实,口音向量能实现细粒度、可组合的口音调控。
English
Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose Accent Vector, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. Accent Vector is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.