ChatPaper.aiChatPaper

个性化推理:即时个性化及其为何大语言模型难以胜任

Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

September 30, 2025
作者: Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov
cs.AI

摘要

当前大型语言模型(LLM)的发展将任务解决与偏好对齐视为两个独立的挑战,首先优化客观正确性,再调整以符合聚合的人类偏好。这一范式在面向人类的应用中失效,因为在这些场景下,仅正确解决问题是不够的,若回应与用户需求不符则仍显不足。这一挑战在即时场景中尤为突出,由于冷启动条件或隐私限制,缺乏先前的用户交互历史。LLMs需要识别其对用户偏好的未知之处,通过提问策略性地引出偏好值,随后调整其推理过程及回应——这一系列复杂的认知过程我们称之为个性化推理。我们提出了PREFDISCO,一种评估方法,它将静态基准转化为交互式个性化任务,采用基于心理学的、偏好稀疏的人物角色。我们的框架构建了场景,其中相同的问题需要根据用户上下文采用不同的推理链,因为最优的解释方法因个人专业知识和偏好而异,同时保持事实准确性。对21个前沿模型在10项任务上的评估显示,29.0%的简单个性化尝试比通用回应更差地匹配了偏好,而通用回应同样无法有效满足个体用户需求。这些发现表明,个性化推理需要专门开发,而非自然涌现。PREFDISCO将个性化推理确立为一个可衡量的研究前沿,并揭示了当前LLMs在交互能力上的根本局限,为开发能够适应教育、医疗和技术领域个性化需求的系统奠定了基础,这些领域中个性化至关重要。
English
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.
PDF32October 6, 2025