인간 심리측정 설문지는 LLM 행동을 잘못 특성화한다

초록

우리는 인간 심리측정 설문지가 일상적인 사용자 상호작용에서 LLM 행동을 특성화하고 예측하는 신뢰할 수 있는 도구로 기능할 수 있는지 조사한다. 두 가지 다른 방법, 즉 기존 설문지(PVQ-40/21 및 BFI-44/10)에 대한 리커트 자기보고와 일상적인 사용자 질의에 대한 가치 함축적 응답의 생성 확률을 통해 도출된 가치 및 성격 프로필을 비교하여 8개의 오픈소스 LLM을 분석한다. 두 프로필은 상당히 차이가 난다. 안정적인 LLM 성향의 증거로 자주 인용되는 구성 내 항목 일관성은 생성 확률에서 사라진다. 이러한 격차는 기존 설문지 항목의 명시적 어휘 단서가 모델로 하여금 대상 구성을 인식하고 일관성 있고 사회적으로 바람직한 방식으로 응답하도록 유도하는 반면, 현실적인 사용자 질의는 그러한 단서를 제공하지 않는다는 사실에 기인한다. 또한, 인구통계학적 페르소나 프롬프트는 실제 인간 패턴과 일치하는 방식으로 인간 설문지에 대한 모델의 응답을 변화시키지만, 현실적인 사용자 질의에 대한 응답의 생성 확률에서는 그러한 변화가 나타나지 않아, 목표 인구통계의 행동을 실제 사용자 상호작용에서 시뮬레이션하는 능력이 제한적임을 보여준다. 전반적으로, 본 연구는 인간 심리측정 설문지가 LLM 행동을 예측하기에 불충분한 도구임을 보여주며, 생성 기반 프로파일링이 더 정확한 측정 방법임을 제안한다.

English

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.