LLMパーソナライゼーションにおける人間の再中心化

要旨

関心が高まっているにもかかわらず、大規模言語モデル（LLM）のパーソナライズ能力の評価のほとんどは、合成データに依存してきました。現在のパーソナライズシステムが実際のユーザにとってどの程度有効かは不明瞭です。本論文では、合成データと人間データを用いた場合のLLMパーソナライズ性能の乖離を研究します。パーソナライズの3段階、すなわち会話からのユーザ属性抽出（5,949件の判定）、新たなプロンプトへの関連属性の対応付け（11,919件）、関連属性を考慮したパーソナライズ応答の生成（1,101件）にわたって、人間との会話（550件の会話）と判断データを収集しました。人間データを取り入れることで、各段階におけるシステムの限界が明らかになりました。モデルは人間の会話から属性を抽出するのに苦戦し、関連属性に関する人間の判断と一致せず、汎用的な応答と比較して人間が評価しても優れていないと判断されるパーソナライズ応答を生成する（ただし、LLM自体の評価では広く優れているとされる）。最初の2段階では、自動パーソナライズ評価を人間データに近づける軽量なトレーニングベースの介入手法を2つ導入します。しかし、3段階目では、学習された報酬モデルが人間の評価とわずかな相関しか示さず、人間の価値観に合致したパーソナライズ品質の判断を直接モデル化することの難しさが示唆されます。収集したデータは、モデルがどのようにユーザ情報を抽出・選択・統合すれば人間にとって有用と感じられるかを研究するための基盤を提供します。

English

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.