LLM의 심리측정 평가 재고: 자기 보고가 행동을 예측하는 시기와 이유

초록

저비용 심리측정 도구를 통해 대규모 언어 모델(LLM)의 행동 경향성을 예측하는 것은 안전한 배치를 위해 중요하지만, 이는 자기보고(SR)가 행동을 신뢰성 있게 예측할 경우에만 해당된다. 최근 연구에서는 LLM에서 자기보고와 행동 간의 상당한 불일치가 보고되었으나, 이는 인간에게서도 특정 행동을 약하게 예측하는 광범위한 성격 특성(빅5)에 의존했다. 더욱이, 대화 세션의 분리와 약한 맥락 일치는 LLM이 진정으로 일관성을 결여하는지, 아니면 그러한 일관성을 탐지하는 데 필요한 조건이 충족되지 않았는지에 대한 의문을 남겼다. 본 연구는 빅5를 특정 행동을 대상으로 한 의도를 측정하며 인간 행동을 광범위한 성격 특성보다 훨씬 더 잘 예측하는 계획된 행동 이론(TPB)과 비교한다. 4가지 행동 과제와 11개의 최첨단 LLM을 대상으로 실험을 수행하고, 세션 맥락과 정체성 유도 조건도 함께 변화시켰다. 그 결과, 자기보고와 행동 간의 일관성은 존재하지만 선택적임을 발견했다. 1) 동일한 대화 내에서 계획된 행동 이론은 인간 수준의 일관성에 도달하는 반면, 빅5는 그렇지 않다. 2) 별도의 대화 간에는, 훈련을 통해 형성된 암묵적 편향처럼 즉각적인 프롬프트 외부에 기반한 행동에 대해서만 일관성이 유지되며, 아첨 행동처럼 맥락에 의해 강하게 점화된 행동의 경우 일관성이 붕괴된다. 3) 페르소나 프롬프팅은 대화 간 자기보고를 더 일관성 있게 만들지만, 행동을 일치시키지는 않는다. 이러한 발견은 빅5와 같은 거친 성격 프레임워크가 배치 행동을 테스트하기 위한 최상의 도구가 아닐 수 있음을 시사한다. 더 과제 및 행동 특화된 도구가 필요하며, 이러한 도구조차도 다양한 과제와 맥락에서 평가되어야 한다.

English

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.