ChatPaper.aiChatPaper

WavReward:配备通用奖励评估器的语音对话模型

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

May 14, 2025
作者: Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao
cs.AI

摘要

诸如GPT-4o-audio等端到端语音对话模型近期在语音领域引起了广泛关注。然而,对于语音对话模型会话性能的评估却长期被忽视。这主要归因于智能聊天机器人传递了大量非文本信息,这些信息难以通过像ChatGPT这样的基于文本的语言模型进行量化。为填补这一空白,我们提出了WavReward,一种基于音频语言模型的奖励反馈模型,能够通过语音输入评估语音对话系统的智商(IQ)与情商(EQ)。具体而言,1)WavReward依托音频语言模型,整合了深度推理过程及非线性奖励机制用于后训练。通过强化学习算法实现多样本反馈,我们构建了一个专为语音对话模型定制的评估器。2)我们引入了ChatReward-30K,一个用于训练WavReward的偏好数据集。ChatReward-30K涵盖了语音对话模型的理解与生成两方面,场景涉及文本聊天、指令聊天的九种声学属性及隐含聊天等多种任务。WavReward在多个语音对话场景中超越了以往最先进的评估模型,在客观准确率上较Qwen2.5-Omni实现了从55.1%到91.5%的显著提升。在主观A/B测试中,WavReward也以83%的优势领先。全面的消融研究证实了WavReward各组成部分的必要性。论文一经录用,所有数据与代码将公开于https://github.com/jishengpeng/WavReward。
English
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1% to 91.5%. In subjective A/B testing, WavReward also leads by a margin of 83%. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.

Summary

AI-Generated Summary

PDF53May 15, 2025