多场景下长语音生成的全面基准评估

摘要

近年来，语音生成领域的进展已实现高保真合成，但在长文本条件下对模型进行系统评估的研究仍相对不足。建立长语音综合评估基准的必要性体现在两个方面：其一，现有测试场景通常局限于有限领域，与多样化的下游应用存在显著差距；其二，现有评估指标忽略了连贯性与一致性等关键长文本因素，难以实现可靠的泛化。为此，我们提出Swanbench-Speech这一综合基准，将长语音质量分解为特定解耦维度。SwanBench-Speech具有三个关键特性：1）丰富的语音场景：聚焦长语音生成与对话生成，覆盖声学、语义及表现力挑战，包含1,101个样本，涵盖17种常见语音场景；2）全面的评估维度：沿声学、语义与表现力三大轴系，定义包含七项指标的自动化评估协议，提供全面、准确且标准化的评估；3）有价值的洞见：通过大量实验，我们发现当前模型在强表现力场景中仍存在困难，且与真实录音在一致性与层次性方面存在显著差距。

English

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.