VStyle：基于语音指令的语音风格适应基准

摘要

口语模型（SLMs）作为一种统一的范式，已在语音理解与生成领域崭露头角，促进了自然的人机交互。然而，尽管多数进展集中于语义准确性与指令遵循，SLMs依据口头指令调整其说话风格的能力却鲜少受到关注。我们提出了语音风格适应（VSA）这一新任务，旨在探究SLMs能否根据自然语言的口头命令，调整其音色、韵律或角色扮演等说话风格。为研究此任务，我们推出了VStyle，一个双语（中英）基准测试，涵盖语音生成的四大类别：声学属性、自然语言指令、角色扮演及隐性共情。同时，我们引入了大音频语言模型作为评判者（LALM as a Judge）框架，该框架逐步评估输出在文本忠实度、风格遵循度及自然度上的表现，确保评估的可重复性与客观性。对商业系统及开源SLMs的实验表明，当前模型在可控风格适应方面存在明显局限，凸显了该任务的新颖性与挑战性。通过发布VStyle及其评估工具包，我们期望为社区提供一个推动以人为中心的语音交互发展的基础。数据集与代码已公开于https://junzhan2000.github.io/VStyle.github.io/{项目主页}。

English

Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at https://junzhan2000.github.io/VStyle.github.io/{project's homepage}.

VStyle：基于语音指令的语音风格适应基准

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

摘要

Support