多様なシナリオにおける長文音声生成の包括的ベンチマーク評価

要旨

近年、音声生成技術の進展により高忠実度な合成が可能となったが、長期コンテキスト条件下におけるモデルの体系的な評価は依然としてほとんど未開拓である。長文音声のための包括的な評価ベンチマークが不可欠である理由は二つある。1) 既存のテストシナリオは限られた領域に留まることが多く、多様な下流アプリケーションとの間に大きなギャップが存在すること。2) 既存の指標は一貫性や首尾一貫性など、長文テキストに特有の要因を見落としており、信頼性のある一般化ができないこと。この目的のために、我々は長文音声の品質を具体的で分離された次元に分解する包括的ベンチマークであるSwanbench-Speechを提案する。SwanBench-Speechには三つの主要な特性がある。1) 豊富な音声シナリオ：長文音声生成と対話生成に焦点を当て、音響、意味、表現力の課題を網羅し、17の一般的な音声シナリオにわたる1,101サンプルから構成される。2) 包括的な評価次元：音響、意味、表現力の軸に沿って、SwanBench-Speechは七つの指標からなる自動評価プロトコルを定義し、包括的で正確かつ標準化された評価を提供する。3) 有益な洞察：広範な実験を通じて、現在のモデルは表現力の高いシナリオで依然として困難に直面しており、実際の録音と比較して一貫性と階層性において顕著なギャップがあることが明らかになった。

English

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.