프랙시 보이스: 음성 프롬프트 복구 및 BUPS를 활용한 상업용 등급 인도어 TTS, 동결된 비인도어 기반에서 상업용 학습 데이터 비용 없이 구축

초록

상용 TTS 시스템은 인도어 계열 언어에서 거의 원어민 수준의 오디오를 생성하지만, 최고의 오픈소스 기반 모델(Chatterbox, Indic Parler-TTS, IndicF5)은 측정된 음운론적 차원에서 이들에 미치지 못하며, 가장 널리 채택된 다국어 기반 모델(Chatterbox, 23개 언어)은 텔루구어나 타밀어도 토큰화하지 않는다. 우리는 다음과 같은 질문을 던진다: 새로운 음향 디코더를 훈련시키지 않고, 상용 TTS 훈련 데이터를 전혀 사용하지 않으면서, 이러한 비(非)인도어 원어민 기반 모델이 텔루구어, 타밀어, 힌디어에서 상용 수준의 결과를 내도록 하는 최소한의 개입은 무엇인가? 우리는 세 가지 요소를 결합했다: (1) BUPS(Brahmic Unified Phoneme Space) - 7가지 인도어 스크립트를 ISO-15919로 결정론적으로 로마자화하여 Chatterbox의 라틴어 토크나이저가 처리할 수 있게 함. (2) 오직 텍스트 토큰 예측기(Chatterbox의 t3)에만 적용된 LoRA 어댑터 - 힌디어 프록시 언어 ID를 사용하여 약 1,220시간의 라이선스된 인도어 오디오로 훈련됨. (3) 음성 프롬프트 복원 방법론 - 8-11초 동일 언어 참조 클립과 세 가지 샘플링 오버라이드(과장 0.7, 온도 0.6, min_p 0.1; "Config B")로 구성되며, 음향 디코더 훈련 없이 상용 수준의 음향 출력을 복원함. 힌디어의 경우 LoRA는 정확도를 저하시켜 대신 기본 Chatterbox + Config B를 사용하여 2가지 브랜치 배포를 구현했다. 동반 PSP 벤치마크를 이용한 10개 발화 파일럿 세트에서 평가한 결과, Praxy Voice는 상용 기준선과 동등하거나 약간 앞선 성능을 보였다: 텔루구어 설측음 붕괴 26.7%(대비 Sarvam Bulbul 33.3%), 타밀어 'zha' 붕괴 71%(대비 상용 3사 86%), 힌디어 LLM-WER 0.025(Cartesia Sonic-3와 동률). 문장 내 코드 혼용의 경우 세 번째 브랜치(IndicF5 + 원본 스크립트 음역)를 추가하여 힌디어/텔루구어/타밀어 전체에서 코드 혼용 LLM-WER을 0.80-0.85에서 0.14-0.27로 낮췄다. 우리는 R6 LoRA 가중치(Apache-2.0), 추론 코드 및 라우터(MIT), Gradio 데모를 공개한다.

English

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

프랙시 보이스: 음성 프롬프트 복구 및 BUPS를 활용한 상업용 등급 인도어 TTS, 동결된 비인도어 기반에서 상업용 학습 데이터 비용 없이 구축

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

초록

Support