개인화된 추론: 적시 개인화와 대형 언어 모델이 실패하는 이유

초록

현재 대형 언어 모델(LLM) 개발은 문제 해결과 선호도 정렬을 별도의 과제로 취급하며, 먼저 객관적 정확성을 최적화한 후 집계된 인간의 선호도에 맞추는 방식으로 진행됩니다. 이러한 패러다임은 사용자와 직접 상호작용하는 응용 프로그램에서는 문제를 올바르게 해결하더라도 응답이 사용자의 요구와 맞지 않으면 충분하지 않다는 한계를 보입니다. 이 문제는 콜드 스타트 조건이나 개인정보 보호 제약으로 인해 사전 사용자 상호작용 기록이 없는 즉각적인 시나리오에서 더욱 심화됩니다. LLM은 사용자 선호도에 대해 알지 못하는 부분을 식별하고, 전략적으로 질문을 통해 선호도 값을 도출한 후, 그에 따라 추론 과정과 응답을 조정해야 합니다. 우리는 이러한 복잡한 인지 과정을 '개인화된 추론(personalized reasoning)'이라고 명명합니다. 본 연구에서는 PREFDISCO라는 평가 방법론을 소개합니다. 이 방법론은 심리학적으로 기반을 둔 희소 선호도를 가진 페르소나를 사용하여 정적 벤치마크를 상호작용형 개인화 작업으로 변환합니다. 우리의 프레임워크는 동일한 질문에 대해 사용자 컨텍스트에 따라 다른 추론 체인이 요구되는 시나리오를 생성하며, 사실적 정확성을 유지하면서 개인의 전문성과 선호도에 따라 최적의 설명 접근 방식이 달라지는 상황을 구현합니다. 10개 작업에 걸쳐 21개의 최첨단 모델을 평가한 결과, 순진한 개인화 시도의 29.0%가 일반적인 응답보다 선호도 정렬이 더 나쁜 것으로 나타났으며, 일반적인 응답 역시 개별 사용자 요구를 효과적으로 충족시키지 못하는 것으로 확인되었습니다. 이러한 결과는 개인화된 추론이 자연스럽게 발생하기보다는 전용 개발이 필요함을 시사합니다. PREFDISCO는 개인화된 추론을 측정 가능한 연구 분야로 확립하고, 현재 LLM의 상호작용 능력에 대한 근본적인 한계를 드러냄으로써 교육, 의료, 기술 분야 등 개인화가 중요한 영역에서 개별 사용자에 적응할 수 있는 시스템 개발의 기반을 마련합니다.

English

Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

개인화된 추론: 적시 개인화와 대형 언어 모델이 실패하는 이유

Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

초록

Support