LLMsの心理測定評価を再考する：自己報告が行動を予測するタイミングとその理由

要旨

低コストの心理測定プローブからLLMの行動傾向を予測することは、安全な展開のために極めて重要である。ただし、それは自己報告（SR）が行動を確実に予測できる場合に限られる。近年の研究では、LLMにおけるSRと行動の間に顕著な乖離が報告されているが、これらの研究は広範な性格特性（ビッグファイブ）に依存しており、ビッグファイブは人間においてさえ特定の行動を弱くしか予測しない。さらに、会話セッションが独立して行われ、文脈の一致が不十分であったため、LLMが本当に一貫性を欠いているのか、あるいはそのような一貫性を検出するために必要な条件が満たされていなかったのかは明らかではなかった。本研究では、ビッグファイブと計画的行動理論（TPB）を比較する。TPBは特定の行動を対象とした意図を測定し、広範な特性よりも人間の行動を大幅に良く予測する。我々は4つの行動タスクと11の最先端LLMにわたって実験を実施し、同時にセッションの文脈やアイデンティティ誘導も変化させた。その結果、SRと行動の間に一貫性は存在するが、それは選択的であることが明らかになった。1) 同一会話内では、計画的行動理論は人間レベルの一貫性に達するが、ビッグファイブではそうならない。2) 別々の会話間では、一貫性は即時のプロンプトの外部に固定された行動（訓練によって形成された暗黙のバイアスなど）に対してのみ維持され、文脈によって強くプライミングされた行動（追従など）では崩壊する。3) ペルソナプロンプティングは会話間で自己報告の一貫性を高めるが、行動を整合させるわけではない。これらの知見は、ビッグファイブのような粗い性格フレームワークが、展開時の行動をテストするための最良のツールではない可能性を示唆している。よりタスク特異的かつ行動特異的な手段が必要であり、それらでさえもタスクや文脈を横断して評価される必要がある。

English

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.