다양한 시나리오에서의 장문 음성 생성에 대한 종합적 벤치마킹

초록

최근 음성 생성 기술의 발전으로 고품질 합성이 가능해졌으나, 긴 문맥 조건에서의 모델 평가는 체계적으로 이루어지지 않은 상태다. 장문 음성에 대한 포괄적인 평가 벤치마크가 필수적인 이유는 두 가지다: 1) 기존 평가 시나리오는 제한된 영역에 국한되어 다양한 하위 응용 분야와의 상당한 간극이 존재하며, 2) 기존 지표는 일관성 및 응집성과 같은 장문 텍스트의 핵심 요소를 간과하여 신뢰성 있는 일반화가 어렵다. 이에 우리는 장문 음성 품질을 구체적이고 분리된 차원으로 세분화하는 포괄적 벤치마크인 SwanBench-Speech를 제안한다. SwanBench-Speech는 세 가지 핵심 특징을 갖는다. 1) 다양한 음성 시나리오: 장문 음성 생성 및 대화 생성에 초점을 맞춰, 음향, 의미, 표현력 측면의 과제를 포함하며 17개의 일반적인 음성 시나리오에 걸친 1,101개의 샘플로 구성된다. 2) 포괄적 평가 차원: 음향, 의미, 표현력 축을 따라 7개의 지표로 구성된 자동 평가 프로토콜을 정의하여 포괄적이고 정확하며 표준화된 평가를 제공한다. 3) 유의미한 통찰: 광범위한 실험을 통해 현재 모델은 표현력이 요구되는 시나리오에서 여전히 어려움을 겪고 있으며, 실제 녹음에 비해 일관성과 위계성에서 현저한 차이를 보임을 밝혀냈다.

English

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.