WavReward:具備通用獎勵評估功能的語音對話模型
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
May 14, 2025
作者: Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao
cs.AI
摘要
如GPT-4o-audio等端到端語音對話模型近期在語音領域引起了廣泛關注。然而,對於語音對話模型會話表現的評估卻在很大程度上被忽視了。這主要是因為智能聊天機器人傳達了大量非文本信息,這些信息無法輕易通過像ChatGPT這樣的基於文本的語言模型來衡量。為填補這一空白,我們提出了WavReward,這是一個基於音頻語言模型的獎勵反饋模型,能夠通過語音輸入評估語音對話系統的智商(IQ)和情商(EQ)。具體而言,1)基於音頻語言模型,WavReward整合了深度推理過程和非線性獎勵機制進行後訓練。通過利用強化學習算法的多樣本反饋,我們構建了一個專為語音對話模型量身定制的評估器。2)我們引入了ChatReward-30K,這是一個用於訓練WavReward的偏好數據集。ChatReward-30K涵蓋了語音對話模型的理解與生成兩個方面,這些場景跨越了多種任務,如基於文本的聊天、指令聊天的九種聲學屬性以及隱含聊天。在多種語音對話場景中,WavReward均超越了先前最先進的評估模型,在客觀準確性上相較於Qwen2.5-Omni實現了從55.1%到91.5%的顯著提升。在主觀A/B測試中,WavReward也以83%的優勢領先。全面的消融研究證實了WavReward各組件的必要性。所有數據和代碼將在論文被接受後公開於https://github.com/jishengpeng/WavReward。
English
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered
significant attention in the speech domain. However, the evaluation of spoken
dialogue models' conversational performance has largely been overlooked. This
is primarily due to the intelligent chatbots convey a wealth of non-textual
information which cannot be easily measured using text-based language models
like ChatGPT. To address this gap, we propose WavReward, a reward feedback
model based on audio language models that can evaluate both the IQ and EQ of
spoken dialogue systems with speech input. Specifically, 1) based on audio
language models, WavReward incorporates the deep reasoning process and the
nonlinear reward mechanism for post-training. By utilizing multi-sample
feedback via the reinforcement learning algorithm, we construct a specialized
evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a
preference dataset used to train WavReward. ChatReward-30K includes both
comprehension and generation aspects of spoken dialogue models. These scenarios
span various tasks, such as text-based chats, nine acoustic attributes of
instruction chats, and implicit chats. WavReward outperforms previous
state-of-the-art evaluation models across multiple spoken dialogue scenarios,
achieving a substantial improvement about Qwen2.5-Omni in objective accuracy
from 55.1% to 91.5%. In subjective A/B testing, WavReward also leads by a
margin of 83%. Comprehensive ablation studies confirm the necessity of each
component of WavReward. All data and code will be publicly at
https://github.com/jishengpeng/WavReward after the paper is accepted.Summary
AI-Generated Summary