パーソナライズド推論：ジャストインタイムのパーソナライゼーションとLLMが失敗する理由

要旨

現在の大規模言語モデル（LLM）の開発では、課題解決と選好整合性を別々の課題として扱い、まず客観的正しさを最適化し、その後集約された人間の選好に整合させることを目指している。このパラダイムは、問題を正しく解決しても、応答がユーザーのニーズに合致しない場合に不十分となる人間向けアプリケーションでは失敗する。この課題は、コールドスタート条件やプライバシー制約により事前のユーザーインタラクション履歴が存在しないジャストインタイムシナリオでさらに深刻化する。LLMは、ユーザーの選好について知らないことを特定し、質問を通じて選好値を戦略的に引き出し、その推論プロセスと応答を適応させる必要がある。この複雑な認知プロセスの連鎖を、我々は「パーソナライズド推論」と呼ぶ。本論文では、PREFDISCOという評価方法論を紹介する。これは、心理学的に基づいたスパースな選好を持つペルソナを使用して、静的ベンチマークをインタラクティブなパーソナライゼーションタスクに変換するものである。我々のフレームワークは、同一の質問でもユーザーの文脈に応じて異なる推論連鎖を必要とするシナリオを作り出す。最適な説明アプローチは、個人の専門知識や選好によって異なるが、事実の正確性は維持される。10のタスクにわたる21の最先端モデルの評価により、ナイーブなパーソナライゼーション試行の29.0%が一般的な応答よりも選好整合性が低いことが明らかになったが、一般的な応答も個々のユーザーニーズに効果的に対応できないことが分かった。これらの結果は、パーソナライズド推論が自然に生じるのではなく、専用の開発を必要とすることを示唆している。PREFDISCOは、パーソナライズド推論を測定可能な研究フロンティアとして確立し、現在のLLMのインタラクティブ能力における根本的な限界を明らかにする。これにより、教育、医療、技術分野などパーソナライゼーションが重要な領域で、個々のユーザーに適応できるシステム開発の基盤を提供する。

English

Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

パーソナライズド推論：ジャストインタイムのパーソナライゼーションとLLMが失敗する理由

Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

要旨

Support