アクセントベクトル：アクセント付きデータを用いない多言語TTSのための制御可能なアクセント操作

要旨

アクセントは社会の不可欠な要素であり、多文化主義を反映し、個人のアイデンティティ表現の在り方を形作る。英語話者の大多数は非母語話者（L2）であるが、現在のテキスト音声合成（TTS）システムは、アクセント付きデータの不足により、主にアメリカ英語アクセントをモデル化している。本論文では、アクセント付き学習データを必要とせずに多言語TTSでアクセント操作を可能にする制御可能な表現「Accent Vector」を提案する。Accent Vectorは、異なる言語（すなわち非英語）の母語話者音声でTTSシステムをファインチューニングし、アクセント特性（すなわち英語における）を捕捉するタスクベクトルを計算することで導出される。このベクトルをスケーリングおよび補間することにより、アクセントの強度に対する細かな制御を実現し、混合アクセント音声を生成する。さらに、この手法は英語以外にも一般化可能で、複数言語にわたるアクセント制御を可能にする。客観的および主観的評価により、Accent Vectorが細粒度かつ合成的なアクセント制御に有効であることが確認された。

English

Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose Accent Vector, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. Accent Vector is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.

アクセントベクトル：アクセント付きデータを用いない多言語TTSのための制御可能なアクセント操作

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

要旨

Support