GRPOを用いた音声対応言語モデルにおける音声理解の進展

要旨

本論文では、Group Relative Policy Optimization (GRPO) に基づく手法を導入し、Speech-Aware Large Language Models (SALLMs) を音声質問応答や自動音声翻訳などのオープンフォーマット音声理解タスクに適用する方法を提案します。SALLMs は音声理解タスクにおいて非常に有効であることが実証されています。GRPO は最近、大規模言語モデル (LLMs) の効率的な学習手法として注目を集めており、これまでの研究では主に多肢選択タスクにおける SALLMs への適用が検討されてきました。本研究では、モデルの生成能力をより反映するオープンフォーマットタスクに焦点を当てます。我々のアプローチでは、GRPO を BLEU を報酬信号として活用し、SALLMs を最適化します。実験的に、この手法が標準的な SFT (Supervised Fine-Tuning) を複数の主要な指標で上回ることを示します。最後に、これらのタスクにおいて GRPO 内でオフポリシーサンプルを組み込む可能性を探り、さらなる改善と研究の方向性を提示します。

English

In this paper, we introduce a Group Relative Policy Optimization (GRPO)-based method for training Speech-Aware Large Language Models (SALLMs) on open-format speech understanding tasks, such as Spoken Question Answering and Automatic Speech Translation. SALLMs have proven highly effective for speech understanding tasks. GRPO has recently gained traction for its efficiency in training LLMs, and prior work has explored its application to SALLMs, primarily in multiple-choice tasks. Building on this, we focus on open-format tasks that better reflect the generative abilities of the models. Our approach leverages GRPO with BLEU as the reward signal to optimize SALLMs, and we demonstrate empirically that it surpasses standard SFT across several key metrics. Finally, we explore the potential of incorporating off-policy samples within GRPO for these tasks, highlighting avenues for further improvement and further research.

GRPOを用いた音声対応言語モデルにおける音声理解の進展

Advancing Speech Understanding in Speech-Aware Language Models with GRPO

要旨

Support