VALL-E 2: ニューラルコーデック言語モデルは人間並みのゼロショットテキスト音声合成を実現

要旨

本論文では、ゼロショットテキスト音声合成（TTS）において初めて人間並みの品質を達成し、画期的な進展を遂げたニューラルコーデック言語モデル「VALL-E 2」を紹介する。前身であるVALL-Eを基盤としたこの新バージョンでは、2つの重要な改良が導入されている。まず、「Repetition Aware Sampling」は、デコード履歴におけるトークンの繰り返しを考慮することで、元のnucleus samplingプロセスを洗練させた。これにより、デコードの安定化が図られるだけでなく、無限ループの問題も回避される。次に、「Grouped Code Modeling」は、コーデックコードをグループ化してシーケンス長を効果的に短縮し、推論速度を向上させるだけでなく、長いシーケンスのモデリングにおける課題にも対処する。LibriSpeechおよびVCTKデータセットでの実験では、VALL-E 2が音声の堅牢性、自然さ、話者類似性において従来のシステムを凌駕し、これらのベンチマークで初めて人間並みの品質に到達したことが示された。さらに、VALL-E 2は、複雑さや繰り返しの多い文など、従来困難とされてきた文に対しても一貫して高品質な音声を合成する。この研究の利点は、失語症や筋萎縮性側索硬化症（ALS）を患う人々のための音声生成など、価値ある取り組みに貢献する可能性がある。VALL-E 2のデモはhttps://aka.ms/valle2に掲載される予定である。

English

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to https://aka.ms/valle2.

VALL-E 2: ニューラルコーデック言語モデルは人間並みのゼロショットテキスト音声合成を実現

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

要旨

Support