ChatPaper.aiChatPaper

Omni-R1:你真的需要音頻來微調你的音頻大語言模型嗎?

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

May 14, 2025
作者: Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
cs.AI

摘要

我們提出了Omni-R1,該模型基於最新的多模態大語言模型Qwen2.5-Omni,在音頻問答數據集上採用強化學習方法GRPO進行微調。這使得該模型在最新的MMAU基準測試中達到了新的最佳性能。Omni-R1在聲音、音樂、語音及整體平均類別上,無論是在Test-mini還是Test-full劃分中,均取得了最高的準確率。為了理解性能提升的原因,我們測試了包含與不包含音頻的模型,發現GRPO帶來的性能提升很大程度上可歸因於基於文本的推理能力的增強。此外,我們還有一個意外的發現,即在僅包含文本的數據集上進行無音頻微調,對於提升基於音頻的性能同樣有效。
English
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

Summary

AI-Generated Summary

PDF41May 15, 2025