다중 도메인 선호도 설명 가능성

초록

인간 선호도, LLM-as-a-Judge(LaaJ), 보상 모델과 같은 선호 메커니즘은 대규모 언어 모델(LLMs)을 정렬하고 평가하는 데 핵심적인 역할을 합니다. 그러나 이러한 선호를 이끄는 근본적인 개념들은 여전히 잘 이해되지 않고 있습니다. 본 연구에서는 다중 도메인에 걸친 선호에 대한 지역적 및 전역적 개념 기반 설명을 자동으로 생성하는 방법을 제안합니다. 우리의 방법은 LLM을 활용하여 선택된 응답과 거부된 응답을 구분하는 개념을 식별하고, 이를 개념 기반 벡터로 표현합니다. 개념과 선호 간의 관계를 모델링하기 위해, 우리는 도메인 일반적 및 도메인 특수적 효과를 모두 포착하는 화이트박스 계층적 다중 도메인 회귀 모델을 제안합니다. 우리의 방법을 평가하기 위해, 우리는 8개의 도전적이고 다양한 도메인을 아우르는 데이터셋을 구축하고 12가지 메커니즘을 설명합니다. 우리의 방법은 강력한 선호 예측 성능을 달성하며, 기준 모델을 능가하면서도 설명 가능성을 유지합니다. 추가적으로, 우리는 두 가지 응용 중심 설정에서 설명을 평가합니다. 첫째, LaaJ 설명에서 도출된 개념으로 LLM 출력을 안내하면, 판단자들이 일관되게 선호하는 응답을 얻을 수 있습니다. 둘째, 인간의 선호를 설명하는 개념으로 LaaJ를 프롬프팅하면, 그들의 선호 예측이 개선됩니다. 종합적으로, 우리의 연구는 LLM 시대의 설명 가능성에 대한 새로운 패러다임을 확립합니다.

English

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.

다중 도메인 선호도 설명 가능성

Multi-Domain Explainability of Preferences

초록

Support