人間用の心理測定質問票はLLMの振る舞いを誤って特徴づける

要旨

人間の心理測定質問票が、日常的なユーザーインタラクションにおけるLLMの行動を特徴づけ予測するための信頼できるツールとして機能するかどうかを検討する。8つのオープンソースLLMを分析し、2つの異なる手法（確立された質問票（PVQ-40/21およびBFI-44/10）におけるリッカート式自己報告、および日常的なユーザークエリに対する価値観を含む応答の生成確率）から導出された価値観と性格プロファイルを比較する。2つのプロファイルは大きく異なる。構成概念内の項目一貫性（しばしばLLMの安定した特性の証拠として引用される）は、生成確率では消失する。この乖離は、確立された質問票の項目に含まれる明示的な語彙的手がかりによって、モデルが対象となる構成概念を認識し、整合性のある社会的に望ましい方法で応答できる一方、現実的なユーザークエリにはそのような手がかりがないことに起因すると考えられる。さらに、人口統計学的ペルソナプロンプトは、実際の人間のパターンと一致する形でモデルの人間向け質問票への応答を変化させるが、現実的なユーザークエリに対する応答の生成確率にはそのような変化は見られず、現実世界のユーザーインタラクションにおいて対象人口統計の行動をシミュレートする能力が限定的であることを示している。全体として、本研究は人間の心理測定質問票がLLMの行動予測には不十分なツールであることを示し、より正確な尺度として生成ベースのプロファイリングを提案する。

English

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.