ChatPaper.aiChatPaper

感知還是偏見:多模態大型語言模型能否超越對人格的第一印象?

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

May 21, 2026
作者: Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang
cs.AI

摘要

多模態大型語言模型(MLLMs)日益被部署於需要人格感知的人機互動場景中,然而現有基準僅透過數值化的大五人格分數預測評估此能力,未能釐清模型究竟是透過行為理解真正感知人格,抑或僅憑表面模式匹配進行偏誤判斷。我們提出三項貢獻以填補此缺口:(i)新任務——定義「基於證據的人格推理」(Grounded Personality Reasoning, GPR),要求 MLLMs 透過「評分→推理→依據」的鏈路,將每項大五人格評分錨定於可觀察證據;(ii)新數據集——釋出 MM-OCEAN(1,104 支影片、5,320 道選擇題),經由多智能體管線與人工驗證產製,內含時間戳記的行為觀察、證據導向的特質分析,以及七大類線索依據選擇題;(iii)基準與分析——設計三層評估(評分、推理、依據),加上四項樣本層級失效指標:偏見率(Prejudice Rate, PR)、虛構率(Confabulation Rate, CR)、整合失敗率(Integration-failure Rate, IR)與整體依據率(Holistic-grounding Rate, HR),並對 27 個 MLLMs(13 個封閉源、14 個開源)進行基準測試。分析揭示一項驚人的「偏見鴻溝」:整體而言,51% 的正確評分並未奠基於檢索到的線索,而整體依據率僅落在 0–33.5% 之間。這些發現暴露了「答對分數」與「合理推理」之間的斷層,為 MLLMs 的接地社會認知發展繪製了路線圖。
English
Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.