ChatPaper.aiChatPaper

多樣場景中長篇語音生成的全面基準測試

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

May 27, 2026
作者: Changhao Pan, Rui Yang, Han Wang, Zhuan Zhou, Xuming He, Wenxiang Guo, Ziyue Jiang, Ruiqi Li, Yu Zhang, Chenyuhao Wen, Ke Lei, Xiang Yin, Jingyu Lu, Zhiyuan Zhu, Zhou Zhao
cs.AI

摘要

近期語音生成技術的進展已能實現高保真合成,然而在長上下文條件下對模型進行系統性評估仍鮮少被探討。為長篇幅語音建立全面的評估基準有其必要性,原因有二:其一,現有測試場景常局限於有限領域,與多元的下游應用之間存在顯著落差;其二,現有指標忽略一致性與連貫性等關鍵長文本因素,無法可靠地泛化。為此,我們提出 Swanbench-Speech 這項綜合性基準,將長篇幅語音品質拆解為特定且解纏的維度。SwanBench-Speech 具備三項關鍵特性:1) 豐富的語音場景:聚焦於長篇幅語音生成與對話生成,涵蓋聲學、語義及表現力等挑戰,共包含 1,101 個樣本,橫跨 17 種常見語音場景;2) 全面的評估維度:沿著聲學、語義及表現力軸向,定義一套自動化評估協議,內含七項指標,以提供全面、精確且標準化的評估;3) 有價值的洞見:透過大規模實驗,我們揭示當前模型在高表現力場景中仍顯吃力,且在一致性與層次結構上與真實錄音存在顯著差距。
English
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.