HumanOmniV2：從理解到基於情境的全模態推理

摘要

隨著多模態大型語言模型的快速發展，深入理解和詮釋人類意圖的能力已成為一項關鍵技能，這需要細緻且周密的推理。在最近的研究中，強化學習（RL）展現了提升大型語言模型（LLMs）推理能力的潛力。然而，將RL適應於多模態數據和格式的挑戰在很大程度上仍未得到解決。本文中，我們指出了現有多模態推理模型中的兩個問題：全局上下文理解不足和捷徑問題。當模型誤解多模態上下文時，可能導致錯誤答案，即為上下文理解不足。捷徑問題則發生在模型忽視多模態輸入中的關鍵線索，直接回答查詢而不考慮多模態信息。為解決這些問題，我們強調模型在推理時需清晰理解多模態輸入中的全局上下文。這種全局上下文理解能有效防止模型忽略關鍵的多模態線索，確保推理過程的全面性。為確保多模態上下文信息的準確解讀，我們實施了由大型語言模型評判的上下文獎勵，以及格式和準確性獎勵。此外，為提升複雜推理能力，我們利用LLM評估邏輯獎勵，判斷推理過程是否成功整合了多模態信息與邏輯方法。我們還引入了推理全模態基準IntentBench，旨在評估模型在理解複雜人類意圖和情感方面的表現。與其他開源全模態模型相比，我們提出的方法在多個全模態基準上展現了卓越的性能。

English

With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

HumanOmniV2：從理解到基於情境的全模態推理

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

摘要

Support