ChatPaper.aiChatPaper

多场景下长语音生成的全面基准评估

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

May 27, 2026
作者: Changhao Pan, Rui Yang, Han Wang, Zhuan Zhou, Xuming He, Wenxiang Guo, Ziyue Jiang, Ruiqi Li, Yu Zhang, Chenyuhao Wen, Ke Lei, Xiang Yin, Jingyu Lu, Zhiyuan Zhu, Zhou Zhao
cs.AI

摘要

近年来,语音生成领域的进展已实现高保真合成,但在长文本条件下对模型进行系统评估的研究仍相对不足。建立长语音综合评估基准的必要性体现在两个方面:其一,现有测试场景通常局限于有限领域,与多样化的下游应用存在显著差距;其二,现有评估指标忽略了连贯性与一致性等关键长文本因素,难以实现可靠的泛化。为此,我们提出Swanbench-Speech这一综合基准,将长语音质量分解为特定解耦维度。SwanBench-Speech具有三个关键特性:1)丰富的语音场景:聚焦长语音生成与对话生成,覆盖声学、语义及表现力挑战,包含1,101个样本,涵盖17种常见语音场景;2)全面的评估维度:沿声学、语义与表现力三大轴系,定义包含七项指标的自动化评估协议,提供全面、准确且标准化的评估;3)有价值的洞见:通过大量实验,我们发现当前模型在强表现力场景中仍存在困难,且与真实录音在一致性与层次性方面存在显著差距。
English
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.