人類心理測量問卷錯誤刻畫大型語言模型行為

摘要

我們探討人類心理計量問卷是否能作為可靠工具，用以描述與預測大型語言模型（LLM）在日常使用者互動中的行為。我們分析了八個開源LLM，比較透過兩種不同方法得出的價值觀與人格特質輪廓：一是基於既有問卷（PVQ-40/21 與 BFI-44/10）的李克特自陳報告，二是針對日常使用者查詢中帶有價值傾向的回應所產生的生成機率。這兩種輪廓存在顯著差異。通常在構念內題項一致性（被視為LLM穩定傾向的證據）在生成機率中消失。我們將此差距歸因於既有問卷題項中明確的詞彙線索，使模型能夠識別目標構念，並以符合一致性、符合社會期望的方式回應；然而真實的使用者查詢並不提供此類線索。此外，人口統計角色提示會以與真實人類模式一致的方式，改變模型在人類問卷上的回應；但在真實使用者查詢回應的生成機率中並未出現此類變化，顯示模型在真實世界使用者互動中模擬目標人口行為的能力有限。總體而言，我們的研究顯示，人類心理計量問卷不足以作為預測LLM行為的工具，並建議以基於生成的輪廓分析作為更準確的衡量方式。

English

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.