악센트 벡터: 악센트 데이터 없이 다국어 TTS에서 제어 가능한 악센트 조작

초록

억양은 다문화를 반영하고 개인의 정체성 표현 방식을 형성하는 사회의 필수적인 요소입니다. 영어 사용자의 대다수는 비원어민(L2)이지만, 현재 텍스트-투-스피치(TTS) 시스템은 억양 데이터의 한계로 주로 미국식 억양 영어를 모델링합니다. 본 연구에서는 억양 학습 데이터 없이도 다국어 TTS에서 억양 조작을 가능하게 하는 제어 가능한 표현인 Accent Vector를 제안합니다. Accent Vector는 다른 언어(즉, 비영어)의 원어민 음성으로 TTS 시스템을 미세 조정하고, 억양 특성(즉, 영어에서)을 포착하는 태스크 벡터를 계산하여 도출됩니다. 벡터의 스케일링과 보간을 통해 억양 강도의 세밀한 제어가 가능하며 혼합 억양 음성을 생성할 수 있습니다. 또한 이 방법은 영어를 넘어 다른 언어들에서도 적용 가능하여 다국어 간 억양 제어를 가능하게 합니다. 객관적 및 인간 평가를 통해 Accent Vector의 세밀하고 구성적인 억양 제어 효과를 입증하였습니다.

English

Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose Accent Vector, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. Accent Vector is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.

악센트 벡터: 악센트 데이터 없이 다국어 TTS에서 제어 가능한 악센트 조작

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

초록

Support