多領域偏好可解釋性
Multi-Domain Explainability of Preferences
May 26, 2025
作者: Nitay Calderon, Liat Ein-Dor, Roi Reichart
cs.AI
摘要
偏好机制,如人类偏好、作为评判者的大型语言模型(LLM-as-a-Judge, LaaJ)以及奖励模型,是调整与评估大型语言模型(LLMs)的核心要素。然而,驱动这些偏好的基础概念仍鲜为人知。本研究提出了一种全自动方法,用于生成跨多个领域的局部与全局基于概念的解释。该方法利用LLM识别区分被选与拒绝响应的概念,并通过基于概念的向量加以表征。为建模概念与偏好间的关系,我们提出了一种白盒层次化多领域回归模型,该模型能够捕捉领域通用与领域特定的效应。为评估该方法,我们构建了一个涵盖八个具有挑战性且多样化领域的数据集,并对十二种机制进行了解释。我们的方法在偏好预测性能上表现优异,超越基线方法的同时保持了可解释性。此外,我们在两个应用导向的场景下评估了解释的有效性。首先,利用LaaJ解释中的概念指导LLM输出,能够生成评判者一致偏好的响应。其次,向LaaJ提供解释人类偏好的概念提示,提升了其偏好预测的准确性。综上所述,本研究为LLM时代的可解释性研究确立了新范式。
English
Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and
reward models, are central to aligning and evaluating large language models
(LLMs). Yet, the underlying concepts that drive these preferences remain poorly
understood. In this work, we propose a fully automated method for generating
local and global concept-based explanations of preferences across multiple
domains. Our method utilizes an LLM to identify concepts that distinguish
between chosen and rejected responses, and to represent them with concept-based
vectors. To model the relationships between concepts and preferences, we
propose a white-box Hierarchical Multi-Domain Regression model that captures
both domain-general and domain-specific effects. To evaluate our method, we
curate a dataset spanning eight challenging and diverse domains and explain
twelve mechanisms. Our method achieves strong preference prediction
performance, outperforming baselines while also being explainable.
Additionally, we assess explanations in two application-driven settings. First,
guiding LLM outputs with concepts from LaaJ explanations yields responses that
those judges consistently prefer. Second, prompting LaaJs with concepts
explaining humans improves their preference predictions. Together, our work
establishes a new paradigm for explainability in the era of LLMs.Summary
AI-Generated Summary