基於個人化大語言模型評判的播客推薦系統評估
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
August 12, 2025
作者: Francesco Fabbri, Gustavo Penha, Edoardo D'Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stal, Mounia Lalmas
cs.AI
摘要
評估個性化推薦仍然是一個核心挑戰,尤其是在播客等長音頻領域,傳統的離線指標存在曝光偏差,而線上方法如A/B測試則成本高昂且操作受限。本文提出了一種新穎的框架,利用大型語言模型(LLMs)作為離線評判者,以可擴展且可解釋的方式評估播客推薦的質量。我們的兩階段基於用戶畫像的方法首先從90天的收聽歷史中提煉出自然語言用戶畫像。這些畫像總結了主題興趣和行為模式,作為用戶偏好的簡潔、可解釋的表徵。我們不使用原始數據來提示LLM,而是利用這些畫像提供高層次、語義豐富的上下文,使LLM能夠更有效地推理用戶興趣與推薦劇集之間的匹配度。這降低了輸入的複雜性並提高了可解釋性。然後,我們提示LLM根據畫像與劇集的匹配度提供細粒度的點對點和成對判斷。在一項包含47名參與者的對照研究中,我們的基於畫像的評判者與人類判斷高度一致,並且優於或匹配了使用原始收聽歷史的變體。該框架為推薦系統中的迭代測試和模型選擇提供了高效、基於畫像的評估方法。
English
Evaluating personalized recommendations remains a central challenge,
especially in long-form audio domains like podcasts, where traditional offline
metrics suffer from exposure bias and online methods such as A/B testing are
costly and operationally constrained. In this paper, we propose a novel
framework that leverages Large Language Models (LLMs) as offline judges to
assess the quality of podcast recommendations in a scalable and interpretable
manner. Our two-stage profile-aware approach first constructs natural-language
user profiles distilled from 90 days of listening history. These profiles
summarize both topical interests and behavioral patterns, serving as compact,
interpretable representations of user preferences. Rather than prompting the
LLM with raw data, we use these profiles to provide high-level, semantically
rich context-enabling the LLM to reason more effectively about alignment
between a user's interests and recommended episodes. This reduces input
complexity and improves interpretability. The LLM is then prompted to deliver
fine-grained pointwise and pairwise judgments based on the profile-episode
match. In a controlled study with 47 participants, our profile-aware judge
matched human judgments with high fidelity and outperformed or matched a
variant using raw listening histories. The framework enables efficient,
profile-aware evaluation for iterative testing and model selection in
recommender systems.