推荐系统中公平性的离线评估指标

摘要

随着近期立法对公平可信人工智能发展的强调，推荐系统公平性评估日益重要。这催生了基于不同公平定义的量化评估指标，但许多指标在提出和使用时缺乏稳健性分析，导致对其局限性的认知不足。例如，我们尚不清楚何种模型输出会产生极端公平分值、指标得分的经验分布规律如何，以及是否存在无法计算的情形（如除零错误）。这些问题导致指标得分难以解读，且在实际应用中难以选择合适的评估指标。本论文通过系列研究系统评估并克服现有推荐系统公平性评估指标在理论、实证与概念层面的局限性。我们针对不同公平概念（按评估对象分为用户端与物品端）和不同评估粒度（群体与个体）的离线评估指标展开研究。首先从理论与实证角度分析现有指标，揭示其可解释性、表达能力及适用性方面的缺陷；其次提出创新的评估方法与指标以突破这些局限；最后结合指标特性提出使用指南，为实际场景中的公平性评估指标选择提供精准依据。总体而言，本论文推动了推荐系统公平性离线评估领域的前沿发展。

English

The evaluation of recommender system fairness has become increasingly important, especially with recent legislation that emphasises the development of fair and responsible artificial intelligence. This has led to the emergence of various fairness evaluation measures, which quantify fairness based on different definitions. However, many of such measures are simply proposed and used without further analysis on their robustness. As a result, there is insufficient understanding and awareness of the measures' limitations. Among other issues, it is not known what kind of model outputs produce the (un)fairest score, how the measure scores are empirically distributed, and whether there are cases where the measures cannot be computed (e.g., due to division by zero). These issues cause difficulty in interpreting the measure scores and confusion on which measure(s) should be used for a specific case. This thesis presents a series of papers that assess and overcome various theoretical, empirical, and conceptual limitations of existing recommender system fairness evaluation measures. We investigate a wide range of offline evaluation measures for different fairness notions, divided based on the evaluation subjects (users and items) and for different evaluation granularities (groups of subjects and individual subjects). Firstly, we perform theoretical and empirical analysis on the measures, exposing flaws that limit their interpretability, expressiveness, or applicability. Secondly, we contribute novel evaluation approaches and measures that overcome these limitations. Finally, considering the measures' limitations, we recommend guidelines for the appropriate measure usage, thereby allowing for more precise selection of fairness evaluation measures in practical scenarios. Overall, this thesis contributes to advancing the state-of-the-art offline evaluation of fairness in recommender systems.