인식인가, 편견인가: MLLM은 성격에 대한 첫인상을 넘어설 수 있는가?

초록

다중모드 대규모 언어 모델(MLLM)은 성격 인식이 중요한 인간 대면 역할에 점점 더 배치되고 있지만, 기존 벤치마크는 이러한 능력을 오직 수치적 Big Five 점수 예측으로만 평가하여, 모델이 행동 이해를 통해 진정으로 성격을 인식하는지 아니면 단순히 표면적 패턴 매칭을 통해 선입견을 형성하는지는 불분명하게 남아있다. 우리는 세 가지 기여를 통해 이 격차를 해소한다. (i) 새로운 과제: 우리는 근거 기반 성격 추론(GPR)을 정식화하며, 이는 MLLM이 각 Big Five 평가를 관찰 가능한 증거에 기반하여 평점, 추론, 근거 제시의 연쇄 과정을 통해 이루도록 요구한다. (ii) 새로운 데이터셋: 우리는 MM-OCEAN(1,104개 비디오, 5,320개 MCQ)을 출시하며, 이는 인간 검증을 거친 다중 에이전트 파이프라인으로 제작되었으며, 타임스탬프가 있는 행동 관찰, 증거 기반 특성 분석, 그리고 7가지 범주의 단서 근거 MCQ를 포함한다. (iii) 벤치마크 및 분석: 우리는 세 단계 평가(평점, 추론, 근거)와 네 가지 샘플 수준 실패 모드 지표(선입견율 PR, 혼란율 CR, 통합 실패율 IR, 전체적 근거율 HR)를 설계하고, 27개의 MLLM(13개 폐쇄형, 14개 개방형)을 벤치마킹한다. 분석 결과 놀라운 선입견 격차가 드러난다: 전체 분야에서 올바른 평점의 51%가 검색된 단서에 근거하지 않았으며, 전체적 근거율은 0-33.5%에 불과했다. 이러한 발견은 올바른 점수를 얻는 것과 올바른 이유로 추론하는 것 사이의 괴리를 드러내며, MLLM에서 근거 기반 사회 인지를 위한 로드맵을 제시한다.

English

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.