ChatPaper.aiChatPaper

Omni-R1:音频大语言模型微调,真的需要音频数据吗?

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

May 14, 2025
作者: Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
cs.AI

摘要

我们提出了Omni-R1,它通过强化学习方法GRPO在音频问答数据集上对最新的多模态大语言模型Qwen2.5-Omni进行微调。这一方法在最新的MMAU基准测试中实现了新的最先进性能。Omni-R1在声音、音乐、语音及整体平均类别上,无论是在Test-mini还是Test-full划分中,均取得了最高的准确率。为了深入理解性能提升的原因,我们测试了包含与不包含音频的模型,发现GRPO带来的大部分性能提升可归因于更优的文本推理能力。此外,我们意外发现,在纯文本数据集上进行无音频微调,也能有效提升基于音频的性能表现。
English
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.
PDF82May 15, 2025