GRPO를 활용한 음향 인식 언어 모델의 음성 이해 능력 향상

초록

본 논문에서는 개방형 음성 이해 작업(예: 음성 질의응답 및 자동 음성 번역)에 대해 음성 인식 대형 언어 모델(SALLMs)을 훈련하기 위해 그룹 상대적 정책 최적화(GRPO) 기반 방법을 소개합니다. SALLMs은 음성 이해 작업에서 매우 효과적인 것으로 입증되었습니다. GRPO는 최근 LLMs 훈련의 효율성으로 주목받고 있으며, 선행 연구에서는 주로 객관식 작업에 GRPO를 SALLMs에 적용하는 방법을 탐구했습니다. 이를 바탕으로, 우리는 모델의 생성 능력을 더 잘 반영하는 개방형 작업에 초점을 맞춥니다. 우리의 접근 방식은 BLEU를 보상 신호로 활용하여 GRPO를 통해 SALLMs를 최적화하며, 여러 주요 지표에서 표준 SFT를 능가한다는 것을 실증적으로 입증합니다. 마지막으로, 이러한 작업에서 GRPO 내에 오프-정책 샘플을 통합할 가능성을 탐구하여 추가 개선 및 연구를 위한 방향성을 제시합니다.

English

In this paper, we introduce a Group Relative Policy Optimization (GRPO)-based method for training Speech-Aware Large Language Models (SALLMs) on open-format speech understanding tasks, such as Spoken Question Answering and Automatic Speech Translation. SALLMs have proven highly effective for speech understanding tasks. GRPO has recently gained traction for its efficiency in training LLMs, and prior work has explored its application to SALLMs, primarily in multiple-choice tasks. Building on this, we focus on open-format tasks that better reflect the generative abilities of the models. Our approach leverages GRPO with BLEU as the reward signal to optimize SALLMs, and we demonstrate empirically that it surpasses standard SFT across several key metrics. Finally, we explore the potential of incorporating off-policy samples within GRPO for these tasks, highlighting avenues for further improvement and further research.

GRPO를 활용한 음향 인식 언어 모델의 음성 이해 능력 향상

Advancing Speech Understanding in Speech-Aware Language Models with GRPO

초록

Support