HumanOmniV2:从理解到基于上下文的全模态推理
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
June 26, 2025
作者: Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
cs.AI
摘要
随着多模态大语言模型的快速发展,深入理解和解读人类意图的能力已成为一项关键技能,这需要细致周密的推理。近期研究表明,强化学习(RL)在提升大语言模型(LLMs)的推理能力方面展现出潜力。然而,将RL应用于多模态数据及格式的挑战在很大程度上仍未得到解决。本文中,我们指出了现有多模态推理模型中的两个问题:全局上下文理解不足和捷径问题。全局上下文理解不足可能发生在模型误解多模态上下文时,导致错误答案。捷径问题则出现在模型忽视多模态输入中的关键线索,直接回应查询而忽略多模态信息。为解决这些问题,我们强调模型需在多模态输入中清晰理解全局上下文进行推理的重要性。这种全局上下文理解能有效防止模型遗漏关键多模态线索,确保推理过程的全面性。为确保准确解读多模态上下文信息,我们实施了一项由大语言模型评判的上下文奖励,以及格式和准确性奖励。此外,为提升复杂推理能力,我们利用LLM评估逻辑奖励,判断推理过程是否成功地将多模态信息与逻辑方法相结合。我们还引入了推理全模态基准IntentBench,旨在评估模型在理解复杂人类意图和情感方面的表现。相较于其他开源全模态模型,我们提出的方法在多个全模态基准测试中展现了卓越性能。
English
With the rapid evolution of multimodal large language models, the capacity to
deeply understand and interpret human intentions has emerged as a critical
capability, which demands detailed and thoughtful reasoning. In recent studies,
Reinforcement Learning (RL) has demonstrated potential in enhancing the
reasoning capabilities of Large Language Models (LLMs). Nonetheless, the
challenges associated with adapting RL to multimodal data and formats remain
largely unaddressed. In this paper, we identify two issues in existing
multimodal reasoning models: insufficient global context understanding and
shortcut problems. Insufficient context understanding can happen when a model
misinterprets multimodal context, resulting in incorrect answers. The shortcut
problem occurs when the model overlooks crucial clues in multimodal inputs,
directly addressing the query without considering the multimodal information.
To tackle these issues, we emphasize the necessity for the model to reason with
a clear understanding of the global context within multimodal inputs. This
global context understanding can effectively prevent the model from overlooking
key multimodal cues and ensure a thorough reasoning process. To ensure the
accurate interpretation of multimodal context information, we implement a
context reward judged by a large language model, alongside format and accuracy
rewards. Additionally, to improve complex reasoning capability, we employ the
LLM to assess the logical reward, determining whether the reasoning process
successfully integrates multimodal information with logical methods. We also
introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating
models in understanding complex human intentions and emotions. Our proposed
method demonstrates advanced performance across multiple omni-modal benchmarks
compared to other open-source omni-modal models.