个性化推理:即时个性化及其为何大语言模型未能胜任
Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It
September 30, 2025
作者: Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov
cs.AI
摘要
当前大型语言模型(LLM)的发展将任务解决与偏好对齐视为两个独立的挑战,首先优化客观正确性,再调整以符合人类聚合偏好。这一范式在面向人类的应用中失效,因为在这些场景下,即便问题被正确解决,若回应与用户需求不匹配,仍显不足。这一挑战在即时场景中尤为突出,由于冷启动条件或隐私限制,缺乏先前的用户互动历史。LLM需识别其对用户偏好未知之处,通过提问策略性地引出偏好值,随后调整其推理过程与回应——这一复杂的认知链我们称之为个性化推理。我们引入PREFDISCO,一种将静态基准转化为互动个性化任务的评估方法,采用基于心理学的、偏好稀疏的角色。我们的框架构建了相同问题因用户情境不同而需不同推理链的场景,因为最佳解释方法随个人专长与偏好而异,同时保持事实准确性。对21个前沿模型在10项任务上的评估显示,29.0%的简单个性化尝试比通用回应更差地匹配偏好,而通用回应同样无法有效满足个体用户需求。这些发现表明,个性化推理需要专门开发,而非自然涌现。PREFDISCO确立了个性化推理作为一个可衡量的研究前沿,并揭示了当前LLM在互动能力上的根本局限,为开发能够适应教育、医疗及技术领域个体用户的系统奠定了基础,这些领域个性化至关重要。
English
Current large language model (LLM) development treats task-solving and
preference alignment as separate challenges, optimizing first for objective
correctness, then for alignment to aggregated human preferences. This paradigm
fails in human-facing applications where solving a problem correctly is
insufficient if the response mismatches the user's needs. This challenge
intensifies in just-in-time scenarios where no prior user interaction history
exists due to cold-start conditions or privacy constraints. LLMs need to
identify what they don't know about user preferences, strategically elicit
preference values through questioning, then adapt their reasoning processes and
responses accordingly -- a complicated chain of cognitive processes which we
term personalized reasoning. We introduce PREFDISCO, an evaluation methodology
that transforms static benchmarks into interactive personalization tasks using
psychologically-grounded personas with sparse preferences. Our framework
creates scenarios where identical questions require different reasoning chains
depending on user context, as optimal explanation approaches vary by individual
expertise and preferences while maintaining factual accuracy. Evaluation of 21
frontier models across 10 tasks reveals 29.0% of naive personalization attempts
produce worse preference alignment than generic responses, yet generic
responses also fail to serve individual user needs effectively. These findings
suggest personalized reasoning requires dedicated development rather than
emerging naturally. PREFDISCO establishes personalized reasoning as a
measurable research frontier and reveals fundamental limitations in current
LLMs' interactive capabilities, providing a foundation for developing systems
that can adapt to individual users in education, healthcare, and technical
domains where personalization is critical.