Omni-R1: Heb je echt audio nodig om je audio-LLM te fine-tunen?

Samenvatting

Wij stellen Omni-R1 voor, dat een recente multimodale LLM, Qwen2.5-Omni, fine-tunt op een audio-vraag-antwoorddataset met de reinforcement learning-methode GRPO. Dit resulteert in een nieuwe state-of-the-art prestatie op de recente MMAU-benchmark. Omni-R1 behaalt de hoogste nauwkeurigheden in de categorieën geluiden, muziek, spraak en het algemene gemiddelde, zowel op de Test-mini als de Test-full splits. Om de prestatieverbetering te begrijpen, hebben we modellen getest zowel met als zonder audio en ontdekten dat een groot deel van de prestatieverbetering door GRPO kon worden toegeschreven aan betere tekstgebaseerde redenering. We deden ook de verrassende ontdekking dat fine-tuning zonder audio op een tekst-only dataset effectief was in het verbeteren van de audiogebaseerde prestaties.

English

We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

Omni-R1: Heb je echt audio nodig om je audio-LLM te fine-tunen?

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Samenvatting

Summary

Support

Support