基于用户画像的LLM-as-a-Judge评估播客推荐系统

摘要

评估个性化推荐仍是一个核心挑战，尤其在播客等长音频领域，传统的离线指标易受曝光偏差影响，而在线方法如A/B测试则成本高昂且受操作限制。本文提出了一种创新框架，利用大型语言模型（LLMs）作为离线评判者，以可扩展且可解释的方式评估播客推荐质量。我们的两阶段用户画像感知方法首先从90天的收听历史中提炼出自然语言用户画像，这些画像既概括了主题兴趣也反映了行为模式，作为用户偏好的简洁、可解释表征。不同于直接向LLM输入原始数据，我们利用这些画像提供高层次、语义丰富的上下文，使LLM能更有效地推理用户兴趣与推荐节目之间的契合度，从而降低输入复杂度并提升可解释性。随后，LLM基于画像与节目的匹配度，给出细粒度的点对点及成对评判。在一项包含47名参与者的对照研究中，我们的画像感知评判者与人类评判高度一致，且表现优于或等同于使用原始收听历史的变体。该框架为推荐系统的迭代测试与模型选择提供了高效、画像感知的评估手段。

English

Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user's interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.

基于用户画像的LLM-as-a-Judge评估播客推荐系统

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

摘要

Support