重新思考大语言模型的心理测量学评估：自我报告何时以及为何能预测行为

摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要，但前提是自我报告能可靠地预测行为。近期研究记录了LLM中显著的自我报告与行为分离现象，但这些研究依赖于大五人格这类宽泛人格特质，而即使是人类，这类特质对特定行为的预测能力也较弱。此外，对话会话的隔离以及弱上下文匹配条件，使得我们无法确定LLM是否真正缺乏连贯性，抑或是检测这种连贯性所需的条件未能满足。我们将大五人格与计划行为理论进行对比——后者测量针对特定行为的意图，且对人类行为的预测能力显著优于宽泛特质。我们在四个行为任务和11个前沿LLM上开展实验，同时变化会话上下文和身份诱导。研究发现，自我报告与行为的连贯性存在但具有选择性：1) 在同一对话内，计划行为理论达到人类水平的连贯性，而大五人格则不能；2) 跨不同对话时，仅当行为锚定于即时提示之外的因素（如训练塑造的内隐偏见）时连贯性得以保持，而当行为被上下文强烈启动（如谄媚倾向）时连贯性消失；3) 角色提示使跨对话的自我报告更一致，但并未使行为与之对齐。这些发现表明，大五人格这类粗粒度人格框架可能并非测试部署行为的最佳工具。我们需要更多面向任务和特定行为的测量工具，即便如此，这些工具也需跨任务和上下文进行评估。

English

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.