WavReward：汎用報酬評価器を備えた音声対話モデル

要旨

GPT-4o-audioのようなエンドツーエンド音声対話モデルは、最近、音声領域で大きな注目を集めています。しかし、音声対話モデルの会話性能の評価は、これまでほとんど注目されてきませんでした。これは主に、インテリジェントなチャットボットが伝える非テキスト情報の豊富さが、ChatGPTのようなテキストベースの言語モデルでは容易に測定できないためです。このギャップを埋めるために、我々はWavRewardを提案します。これは、音声入力を用いて音声対話システムのIQとEQを評価できる、音声言語モデルに基づく報酬フィードバックモデルです。具体的には、1) 音声言語モデルに基づき、WavRewardは深い推論プロセスと非線形報酬メカニズムをポストトレーニングに組み込みます。強化学習アルゴリズムによるマルチサンプルフィードバックを活用することで、音声対話モデルに特化した評価器を構築します。2) WavRewardのトレーニングに使用される選好データセットであるChatReward-30Kを導入します。ChatReward-30Kは、音声対話モデルの理解と生成の両面を含んでいます。これらのシナリオは、テキストベースのチャット、指示チャットの9つの音響属性、暗黙のチャットなど、さまざまなタスクにわたります。WavRewardは、複数の音声対話シナリオにおいて、従来の最先端評価モデルを上回り、Qwen2.5-Omniとの比較で客観的精度を55.1%から91.5%に大幅に向上させました。主観的なA/Bテストにおいても、WavRewardは83%の差でリードしています。包括的なアブレーションスタディにより、WavRewardの各コンポーネントの必要性が確認されました。すべてのデータとコードは、論文が受理された後、https://github.com/jishengpeng/WavRewardで公開されます。

English

End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1% to 91.5%. In subjective A/B testing, WavReward also leads by a margin of 83%. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.

WavReward：汎用報酬評価器を備えた音声対話モデル

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

要旨

Support