人类感知：通过多模态大语言模型推理，从多模态感知到共情式情境感知响应

摘要

尽管多模态大语言模型（MLLMs）在实现真正类人交互方面展现出巨大潜力，但由于缺乏针对以人为中心场景的细粒度评估框架，其进展受到阻碍。这些框架需涵盖对复杂人类意图的理解以及提供富有同理心、上下文感知的回应。为此，我们引入了HumanSense，一个旨在评估MLLMs以人为中心的感知与交互能力的综合基准，特别聚焦于对扩展多模态上下文的深度理解及合理反馈的生成。我们的评估显示，领先的MLLMs仍有显著提升空间，尤其是在面向高级交互任务时。通过将视觉输入与音频及文本信息相结合，可带来实质性改进，而全模态模型在这些任务上展现出优势。此外，我们主张恰当的反馈源于对对话者需求与情感的情境分析，推理能力则是解锁这一点的关键。相应地，我们采用多阶段、模态递进的强化学习来增强全模态模型的推理能力，从而在评估结果上取得显著提升。同时，我们观察到成功的推理过程呈现出高度一致的思维模式。通过设计相应的提示，我们也在无需训练的情况下提升了非推理模型的性能。项目页面：brightpinkhttps://digital-avatar.github.io/ai/HumanSense/

English

While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: brightpinkhttps://digital-avatar.github.io/ai/HumanSense/

人类感知：通过多模态大语言模型推理，从多模态感知到共情式情境感知响应

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

摘要

Support