人类心理测量问卷误读LLM行为

摘要

我们研究了人类心理测量问卷是否可以作为可靠工具，用于描述和预测大语言模型（LLM）在日常用户交互中的行为。我们分析了八个开源LLM，通过比较两种不同方法得出的价值观和人格画像：一种是基于成熟问卷（PVQ-40/21和BFI-44/10）的李克特自评报告，另一种是对日常用户查询中带有价值观倾向的回答的生成概率。这两种画像存在显著差异。通常被视为LLM具有稳定倾向证据的构念内项目一致性，在生成概率中消失了。我们将这一差距归因于：成熟问卷项目中的显性词汇线索使模型能够识别目标构念，并做出与一致性相符、符合社会期望的反应，而真实的用户查询不提供此类线索。此外，人口统计角色提示使模型对问卷的回应产生与真实人类模式一致的偏移，但在对真实用户查询的回应生成概率中未出现此类偏移，这表明模型在模拟目标人群真实世界用户交互行为方面的能力有限。总体而言，我们的研究表明，人类心理测量问卷不足以预测LLM行为，并提示基于生成的画像是一种更准确的测量方法。

English

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.