HumanSense: 추론 기반 다중언어 대형 언어 모델을 통한 다중모달 인지에서 공감적 상황 인식 응답으로

초록

다중모달 대형 언어 모델(MLLMs)은 진정한 인간과 같은 상호작용을 달성하는 데 있어 엄청난 잠재력을 보여주지만, 복잡한 인간의 의도를 이해하고 공감적이며 상황을 고려한 응답을 제공하는 등 인간 중심 시나리오에 대한 세밀한 평가 프레임워크의 부재로 인해 진전이 더딘 상황이다. 여기서 우리는 HumanSense를 소개한다. HumanSense는 MLLMs의 인간 중심 인지 및 상호작용 능력을 평가하기 위한 포괄적인 벤치마크로, 특히 확장된 다중모달 컨텍스트에 대한 깊은 이해와 합리적인 피드백의 형성에 초점을 맞추고 있다. 우리의 평가 결과, 선도적인 MLLMs도 여전히 상당한 개선의 여지가 있으며, 특히 고급 상호작용 지향 작업에서 그러하다는 것을 보여준다. 시각적 입력에 오디오와 텍스트 정보를 추가하면 상당한 개선이 이루어지며, 오므니모달 모델이 이러한 작업에서 우위를 보인다. 더 나아가, 적절한 피드백은 상대방의 요구와 감정에 대한 상황적 분석에서 비롯되며, 이를 해제하는 열쇠는 추론 능력이 된다고 주장한다. 이에 따라, 우리는 다단계, 모달리티 점진적 강화 학습을 활용하여 오므니 모델의 추론 능력을 강화하고, 평가 결과에서 상당한 향상을 달성했다. 또한, 성공적인 추론 과정은 매우 일관된 사고 패턴을 보인다는 것을 관찰했다. 이에 상응하는 프롬프트를 설계함으로써, 우리는 훈련 없이도 비추론 모델의 성능을 향상시켰다. 프로젝트 페이지: brightpinkhttps://digital-avatar.github.io/ai/HumanSense/

English

While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: brightpinkhttps://digital-avatar.github.io/ai/HumanSense/

HumanSense: 추론 기반 다중언어 대형 언어 모델을 통한 다중모달 인지에서 공감적 상황 인식 응답으로

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

초록

Support