感知还是偏见:多模态大语言模型能否超越对个性的第一印象?
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
May 21, 2026
作者: Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang
cs.AI
摘要
多模态大语言模型(MLLMs)正越来越多地被部署在需要人格感知的人机交互场景中,然而现有基准仅通过大五人格分数的数值预测来评估这种能力,从而留下了关键疑问:这些模型究竟是真正通过行为理解来感知人格,还是仅仅依赖表面模式匹配进行预判?我们通过三项贡献弥补这一空白。(i)新任务:我们形式化了“扎根人格推理”(GPR),要求MLLMs通过评分、推理与扎根的链式过程,将每项大五人格评分锚定于可观察证据之上。(ii)新数据集:我们发布了MM-OCEAN(包含1,104个视频,5,320道多选题),该数据集通过多智能体流水线生成并经过人工验证,包含带时间戳的行为观察、基于证据的人格特质分析,以及七类线索扎根多选题。(iii)基准测试与分析:我们设计了三层评估(评分、推理、扎根),并引入四项样本级失效模式指标——偏见率(PR)、虚构率(CR)、整合失败率(IR)和整体扎根率(HR),对27个MLLMs(13个闭源、14个开源)进行了基准测试。分析揭示了一个显著的“偏见鸿沟”:在全部模型中,51%的正确评分并未扎根于检索到的线索,且整体扎根率仅落在0–33.5%之间。这些发现暴露了“得到正确分数”与“凭正确理由推理”之间的脱节,为MLLMs扎根社会认知绘制了发展路线图。
English
Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.