多樣場景中長篇語音生成的全面基準測試

摘要

近期語音生成技術的進展已能實現高保真合成，然而在長上下文條件下對模型進行系統性評估仍鮮少被探討。為長篇幅語音建立全面的評估基準有其必要性，原因有二：其一，現有測試場景常局限於有限領域，與多元的下游應用之間存在顯著落差；其二，現有指標忽略一致性與連貫性等關鍵長文本因素，無法可靠地泛化。為此，我們提出 Swanbench-Speech 這項綜合性基準，將長篇幅語音品質拆解為特定且解纏的維度。SwanBench-Speech 具備三項關鍵特性：1) 豐富的語音場景：聚焦於長篇幅語音生成與對話生成，涵蓋聲學、語義及表現力等挑戰，共包含 1,101 個樣本，橫跨 17 種常見語音場景；2) 全面的評估維度：沿著聲學、語義及表現力軸向，定義一套自動化評估協議，內含七項指標，以提供全面、精確且標準化的評估；3) 有價值的洞見：透過大規模實驗，我們揭示當前模型在高表現力場景中仍顯吃力，且在一致性與層次結構上與真實錄音存在顯著差距。

English

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.