知覚か偏見か：MLLMは性格の第一印象を超えられるか？

要旨

マルチモーダル大規模言語モデル（MLLM）は、性格認識が重要となる対人応用領域への導入が進んでいる。しかし既存のベンチマークは、ビッグファイブ数値スコアの予測精度のみでこの能力を評価しており、モデルが行動理解を通じて真に性格を認識しているのか、それとも表面的なパターン照合による先入観で判断しているに過ぎないのかは未解明である。本研究では以下の3つの貢献によりこの課題に取り組む。(i) 新タスクの提案：根拠に基づく性格推論（GPR）を形式化する。これはMLLMに対し、各ビッグファイブ評価項目について、評価・推論・根拠付けの連鎖を通じて観測可能な証拠に基づく判断を求めるものである。(ii) 新データセットの構築：MM-OCEAN（動画1,104本、5,320問の多肢選択問題）を公開する。本データセットは、人間による検証を経たマルチエージェントパイプラインにより生成され、タイムスタンプ付き行動観察、根拠に基づく特性分析、7カテゴリにわたる手掛かり根拠付け問題を含む。(iii) ベンチマークと分析：三層評価（評価・推論・根拠付け）に加え、4つのサンプル単位障害率指標（偏見率PR、作話率CR、統合不全率IR、包括的根拠付け率HR）を設計し、27のMLLM（クローズド13、オープン14）を評価する。分析により顕著な「偏見ギャップ」が明らかとなった。すなわち、正しい評価全体の51%が取得された手掛かりに基づいておらず、包括的根拠付け率は0～33.5%に留まる。これらの知見は、正しいスコアを得ることと正当な理由に基づく推論を行うことの乖離を浮き彫りにし、MLLMにおける根拠に基づく社会的認知の実現に向けた道筋を示すものである。

English

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.