HumanOmniV2:從理解到基於情境的全模態推理
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
June 26, 2025
作者: Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
cs.AI
摘要
隨著多模態大型語言模型的快速發展,深入理解和詮釋人類意圖的能力已成為一項關鍵技能,這需要細緻且周密的推理。在最近的研究中,強化學習(RL)展現了提升大型語言模型(LLMs)推理能力的潛力。然而,將RL適應於多模態數據和格式的挑戰在很大程度上仍未得到解決。本文中,我們指出了現有多模態推理模型中的兩個問題:全局上下文理解不足和捷徑問題。當模型誤解多模態上下文時,可能導致錯誤答案,即為上下文理解不足。捷徑問題則發生在模型忽視多模態輸入中的關鍵線索,直接回答查詢而不考慮多模態信息。為解決這些問題,我們強調模型在推理時需清晰理解多模態輸入中的全局上下文。這種全局上下文理解能有效防止模型忽略關鍵的多模態線索,確保推理過程的全面性。為確保多模態上下文信息的準確解讀,我們實施了由大型語言模型評判的上下文獎勵,以及格式和準確性獎勵。此外,為提升複雜推理能力,我們利用LLM評估邏輯獎勵,判斷推理過程是否成功整合了多模態信息與邏輯方法。我們還引入了推理全模態基準IntentBench,旨在評估模型在理解複雜人類意圖和情感方面的表現。與其他開源全模態模型相比,我們提出的方法在多個全模態基準上展現了卓越的性能。
English
With the rapid evolution of multimodal large language models, the capacity to
deeply understand and interpret human intentions has emerged as a critical
capability, which demands detailed and thoughtful reasoning. In recent studies,
Reinforcement Learning (RL) has demonstrated potential in enhancing the
reasoning capabilities of Large Language Models (LLMs). Nonetheless, the
challenges associated with adapting RL to multimodal data and formats remain
largely unaddressed. In this paper, we identify two issues in existing
multimodal reasoning models: insufficient global context understanding and
shortcut problems. Insufficient context understanding can happen when a model
misinterprets multimodal context, resulting in incorrect answers. The shortcut
problem occurs when the model overlooks crucial clues in multimodal inputs,
directly addressing the query without considering the multimodal information.
To tackle these issues, we emphasize the necessity for the model to reason with
a clear understanding of the global context within multimodal inputs. This
global context understanding can effectively prevent the model from overlooking
key multimodal cues and ensure a thorough reasoning process. To ensure the
accurate interpretation of multimodal context information, we implement a
context reward judged by a large language model, alongside format and accuracy
rewards. Additionally, to improve complex reasoning capability, we employ the
LLM to assess the logical reward, determining whether the reasoning process
successfully integrates multimodal information with logical methods. We also
introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating
models in understanding complex human intentions and emotions. Our proposed
method demonstrates advanced performance across multiple omni-modal benchmarks
compared to other open-source omni-modal models.