LLM 개인화에서 인간 중심 재정립

초록

관심이 증가하고 있음에도 불구하고, 대규모 언어 모델(LLM)의 개인화 능력에 대한 대부분의 평가는 합성 데이터에 의존해 왔다. 현재의 개인화 시스템이 실제 사용자에게 얼마나 잘 작동하는지는 여전히 불분명하다. 본 논문에서는 합성 데이터와 인간 데이터를 사용할 때 LLM 개인화 성능의 차이를 연구한다. 우리는 개인화의 세 단계(대화에서 사용자 속성 추출(5,949건의 판단), 관련 속성을 새로운 프롬프트와 연결(11,919건), 관련 속성을 개인화된 응답에 통합(1,101건))에 걸쳐 인간 대화(550건의 대화)와 판단을 수집했다. 인간 데이터를 통합함으로써 각 단계에서 시스템의 한계가 드러난다. 모델은 인간 대화에서 속성을 추출하는 데 어려움을 겪으며, 관련 속성에 대한 인간의 판단과 일치하지 않고, 인간이 평가하기에 일반 응답보다 나을 것이 없는 개인화된 응답을 생성한다(비록 LLM은 이를 더 우수하다고 널리 평가하지만). 우리는 처음 두 단계에서 자동화된 개인화 평가를 인간 데이터에 더 가깝게 전환하는 두 가지 경량 훈련 기반 중재를 도입한다. 그러나 세 번째 단계에서는 학습된 보상 모델이 인간 평가와 제한적인 상관관계만을 보여, 인간에 정렬된 개인화 품질 판단을 직접 모델링하기 어렵다는 점을 시사한다. 우리가 수집한 데이터는 모델이 인간이 유용하다고 느끼는 방식으로 사용자 정보를 추출, 선택 및 통합하는 방법을 연구하기 위한 기초를 제공한다.

English

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.