推荐系统中公平性的离线评估指标
Offline Evaluation Measures of Fairness in Recommender Systems
April 27, 2026
作者: Theresia Veronika Rampisela
cs.AI
摘要
随着近期立法对公平可信人工智能发展的强调,推荐系统公平性评估的重要性日益凸显。这促使各类公平性评估指标应运而生,它们基于不同定义对公平性进行量化。然而,此类指标大多被简单提出和使用,缺乏对鲁棒性的深入分析,导致学界对其局限性认知不足。其中尤为突出的是:何种模型输出会产生极端公平/不公平分值、指标得分的经验分布规律、以及是否存在无法计算的情形(如除零错误)等问题尚未探明。这些缺陷导致指标得分难以解读,且在具体场景中应选用何种指标存在困惑。
本学位论文通过系列研究系统评估并突破了现有推荐系统公平性评估指标在理论、实证与概念层面的局限性。我们针对基于评估对象(用户/物品)和评估粒度(群体/个体)划分的不同公平概念,对多种离线评估指标展开研究。首先通过理论与实证分析揭示影响指标可解释性、表达能力及适用性的缺陷;继而提出突破这些局限的创新评估方法与指标;最后结合指标局限性提出使用指南,为实际场景中更精准地选择公平性评估指标提供依据。
总体而言,本论文通过推动推荐系统公平性离线评估的前沿研究,为该领域的发展做出了贡献。
English
The evaluation of recommender system fairness has become increasingly important, especially with recent legislation that emphasises the development of fair and responsible artificial intelligence. This has led to the emergence of various fairness evaluation measures, which quantify fairness based on different definitions. However, many of such measures are simply proposed and used without further analysis on their robustness. As a result, there is insufficient understanding and awareness of the measures' limitations. Among other issues, it is not known what kind of model outputs produce the (un)fairest score, how the measure scores are empirically distributed, and whether there are cases where the measures cannot be computed (e.g., due to division by zero). These issues cause difficulty in interpreting the measure scores and confusion on which measure(s) should be used for a specific case.
This thesis presents a series of papers that assess and overcome various theoretical, empirical, and conceptual limitations of existing recommender system fairness evaluation measures. We investigate a wide range of offline evaluation measures for different fairness notions, divided based on the evaluation subjects (users and items) and for different evaluation granularities (groups of subjects and individual subjects). Firstly, we perform theoretical and empirical analysis on the measures, exposing flaws that limit their interpretability, expressiveness, or applicability. Secondly, we contribute novel evaluation approaches and measures that overcome these limitations. Finally, considering the measures' limitations, we recommend guidelines for the appropriate measure usage, thereby allowing for more precise selection of fairness evaluation measures in practical scenarios.
Overall, this thesis contributes to advancing the state-of-the-art offline evaluation of fairness in recommender systems.