R1-Omni：基於強化學習的可解釋全方位多模態情感識別

摘要

在本研究中，我們首次將可驗證獎勵的強化學習（RLVR）應用於全模態大型語言模型，針對情感識別這一任務進行優化，其中視覺和音頻模態均扮演著關鍵角色。我們利用RLVR來優化全模態模型，顯著提升了其在三個關鍵方面的表現：推理能力、情感識別準確度以及泛化能力。RLVR的引入不僅提升了模型在分佈內數據上的整體性能，還在分佈外數據集評估中展現出卓越的魯棒性。更重要的是，增強後的推理能力使得我們能夠清晰分析不同模態，特別是視覺和音頻信息，在情感識別過程中的貢獻。這為多模態大型語言模型的優化提供了寶貴的洞見。

English

In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where both visual and audio modalities play crucial roles. We leverage RLVR to optimize the Omni model, significantly enhancing its performance in three key aspects: reasoning capability, emotion recognition accuracy, and generalization ability. The introduction of RLVR not only improves the model's overall performance on in-distribution data but also demonstrates superior robustness when evaluated on out-of-distribution datasets. More importantly, the improved reasoning capability enables clear analysis of the contributions of different modalities, particularly visual and audio information, in the emotion recognition process. This provides valuable insights into the optimization of multimodal large language models.

R1-Omni：基於強化學習的可解釋全方位多模態情感識別

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

摘要

Support