多領域偏好可解釋性

摘要

偏好机制，如人类偏好、作为评判者的大型语言模型（LLM-as-a-Judge, LaaJ）以及奖励模型，是调整与评估大型语言模型（LLMs）的核心要素。然而，驱动这些偏好的基础概念仍鲜为人知。本研究提出了一种全自动方法，用于生成跨多个领域的局部与全局基于概念的解释。该方法利用LLM识别区分被选与拒绝响应的概念，并通过基于概念的向量加以表征。为建模概念与偏好间的关系，我们提出了一种白盒层次化多领域回归模型，该模型能够捕捉领域通用与领域特定的效应。为评估该方法，我们构建了一个涵盖八个具有挑战性且多样化领域的数据集，并对十二种机制进行了解释。我们的方法在偏好预测性能上表现优异，超越基线方法的同时保持了可解释性。此外，我们在两个应用导向的场景下评估了解释的有效性。首先，利用LaaJ解释中的概念指导LLM输出，能够生成评判者一致偏好的响应。其次，向LaaJ提供解释人类偏好的概念提示，提升了其偏好预测的准确性。综上所述，本研究为LLM时代的可解释性研究确立了新范式。

English

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.

多領域偏好可解釋性

Multi-Domain Explainability of Preferences

摘要

Support