多领域偏好可解释性

摘要

偏好机制，如人类偏好、作为评判者的大型语言模型（LLM-as-a-Judge, LaaJ）及奖励模型，对于大型语言模型（LLMs）的对齐与评估至关重要。然而，驱动这些偏好的根本概念仍鲜为人知。本研究提出了一种全自动方法，用于生成跨多个领域的局部与全局基于概念的解释。该方法利用LLM识别区分选定与拒绝响应的概念，并通过基于概念的向量进行表征。为建模概念与偏好间的关系，我们提出了一种白盒层次化多领域回归模型，该模型能够捕捉领域通用与领域特定的效应。为评估此方法，我们构建了一个涵盖八个具有挑战性且多样化领域的数据集，并解释了十二种机制。我们的方法在偏好预测性能上表现优异，不仅超越了基线模型，还具备良好的可解释性。此外，我们在两个应用导向的场景下评估了这些解释。首先，利用LaaJ解释中的概念指导LLM输出，产生的响应持续获得评判者的青睐。其次，向LaaJ提供解释人类偏好的概念提示，提升了其偏好预测的准确性。综上所述，我们的工作为LLM时代的可解释性研究确立了新范式。

English

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.

多领域偏好可解释性

Multi-Domain Explainability of Preferences

摘要

Support