VStyle:基於語音指令的語音風格適應基準
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
September 9, 2025
作者: Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng
cs.AI
摘要
口語語言模型(SLMs)已成為語音理解與生成的統一範式,促進了自然的人機互動。然而,儘管多數進展集中於語義準確性和指令遵循,SLMs根據口語指令調整其說話風格的能力卻鮮少受到關注。我們提出了語音風格適應(VSA)這一新任務,旨在探討SLMs能否依據自然語言口語指令修改其說話風格,如音色、韻律或角色扮演。為研究此任務,我們推出了VStyle,一個涵蓋語音生成四大類別(聲學屬性、自然語言指令、角色扮演及隱含同理心)的雙語(中文與英文)基準。此外,我們引入了大型音頻語言模型作為評判者(LALM as a Judge)框架,該框架逐步評估輸出在文本忠實度、風格遵循度及自然度上的表現,確保評估的可重複性與客觀性。對商業系統及開源SLMs的實驗表明,當前模型在可控風格適應方面存在明顯限制,凸顯了此任務的新穎性與挑戰性。通過發布VStyle及其評估工具包,我們期望為社區提供推動以人為本的口語互動發展的基礎。數據集與代碼已公開於https://junzhan2000.github.io/VStyle.github.io/{項目主頁}。
English
Spoken language models (SLMs) have emerged as a unified paradigm for speech
understanding and generation, enabling natural human machine interaction.
However, while most progress has focused on semantic accuracy and instruction
following, the ability of SLMs to adapt their speaking style based on spoken
instructions has received limited attention. We introduce Voice Style
Adaptation (VSA), a new task that examines whether SLMs can modify their
speaking style, such as timbre, prosody, or persona following natural language
spoken commands. To study this task, we present VStyle, a bilingual (Chinese &
English) benchmark covering four categories of speech generation: acoustic
attributes, natural language instruction, role play, and implicit empathy. We
also introduce the Large Audio Language Model as a Judge (LALM as a Judge)
framework, which progressively evaluates outputs along textual faithfulness,
style adherence, and naturalness, ensuring reproducible and objective
assessment. Experiments on commercial systems and open source SLMs demonstrate
that current models face clear limitations in controllable style adaptation,
highlighting both the novelty and challenge of this task. By releasing VStyle
and its evaluation toolkit, we aim to provide the community with a foundation
for advancing human centered spoken interaction. The dataset and code are
publicly available at
https://junzhan2000.github.io/VStyle.github.io/{project's homepage}.