추천 시스템의 공정성 오프라인 평가 방법

초록

추천 시스템 공정성 평가는 최근 공정하고 책임 있는 인공지능 개발을 강조하는 법적 조치가 늘어나면서 그 중요성이 더욱 부각되고 있습니다. 이에 따라 다양한 공정성 평가 척도들이 등장했으며, 각기 다른 정의에 기반하여 공정성을 수치화하고 있습니다. 그러나 이러한 척도들 중 상당수는 단순히 제안된 후 그 견고성에 대한 추가 분석 없이 사용되고 있습니다. 그 결과, 척도들의 한계에 대한 이해와 인식이 충분히 이루어지지 않고 있습니다. 특히, 어떤 종류의 모델 출력이 가장 (비)공정한 점수를 산출하는지, 척도 점수의 경험적 분포는 어떠한지, 그리고 척도를 계산할 수 없는 경우(예: 0으로 나누기 오류)가 있는지 등에 대한 정보가 부족합니다. 이러한 문제들은 척도 점수 해석을 어렵게 하고, 특정 경우에 어떤 척도를 사용해야 하는지에 대한 혼란을 초래합니다. 본 논문은 기존 추천 시스템 공정성 평가 척도들이 지닌 다양한 이론적, 경험적, 개념적 한계를 평가하고 극복한 일련의 연구들을 제시합니다. 우리는 평가 대상(사용자와 아이템)과 평가 세분성 수준(대상 그룹과 개별 대상)에 따라 구분된 다양한 공정성 개념에 대한 광범위한 오프라인 평가 척드들을 조사합니다. 첫째, 해당 척드들에 대한 이론적 및 경험적 분석을 수행하여 해석 가능성, 표현력 또는 적용 가능성을 제한하는 결함을 밝혀냅니다. 둘째, 이러한 한계를 극복하는 새로운 평가 접근법과 척도들을 제안합니다. 마지막으로, 척도들의 한계를 고려하여 적절한 척도 사용을 위한 가이드라인을 제안함으로써 실무 환경에서 공정성 평가 척도를 보다 정확하게 선택할 수 있도록 합니다. 전체적으로, 본 논문은 추천 시스템 공정성에 대한 최신 오프라인 평가 기술의 발전에 기여합니다.

English

The evaluation of recommender system fairness has become increasingly important, especially with recent legislation that emphasises the development of fair and responsible artificial intelligence. This has led to the emergence of various fairness evaluation measures, which quantify fairness based on different definitions. However, many of such measures are simply proposed and used without further analysis on their robustness. As a result, there is insufficient understanding and awareness of the measures' limitations. Among other issues, it is not known what kind of model outputs produce the (un)fairest score, how the measure scores are empirically distributed, and whether there are cases where the measures cannot be computed (e.g., due to division by zero). These issues cause difficulty in interpreting the measure scores and confusion on which measure(s) should be used for a specific case. This thesis presents a series of papers that assess and overcome various theoretical, empirical, and conceptual limitations of existing recommender system fairness evaluation measures. We investigate a wide range of offline evaluation measures for different fairness notions, divided based on the evaluation subjects (users and items) and for different evaluation granularities (groups of subjects and individual subjects). Firstly, we perform theoretical and empirical analysis on the measures, exposing flaws that limit their interpretability, expressiveness, or applicability. Secondly, we contribute novel evaluation approaches and measures that overcome these limitations. Finally, considering the measures' limitations, we recommend guidelines for the appropriate measure usage, thereby allowing for more precise selection of fairness evaluation measures in practical scenarios. Overall, this thesis contributes to advancing the state-of-the-art offline evaluation of fairness in recommender systems.

추천 시스템의 공정성 오프라인 평가 방법

Offline Evaluation Measures of Fairness in Recommender Systems

초록

Support