マルチドメイン選好説明可能性

要旨

人間の選好、LLM-as-a-Judge（LaaJ）、報酬モデルなどの選好メカニズムは、大規模言語モデル（LLM）のアラインメントと評価において中心的な役割を果たします。しかし、これらの選好を駆動する根本的な概念は十分に理解されていません。本研究では、複数のドメインにわたる選好の局所的およびグローバルな概念ベースの説明を自動生成する手法を提案します。提案手法では、LLMを活用して、選択された応答と拒否された応答を区別する概念を特定し、それらを概念ベースのベクトルとして表現します。概念と選好の関係をモデル化するために、ドメイン一般およびドメイン固有の効果を捉える白箱型の階層的マルチドメイン回帰モデルを提案します。提案手法を評価するため、8つの挑戦的で多様なドメインにわたるデータセットを構築し、12のメカニズムを説明します。提案手法は、高い選好予測性能を達成し、ベースラインを上回ると同時に説明可能性も備えています。さらに、2つのアプリケーション駆動型の設定で説明を評価します。第一に、LaaJの説明から得られた概念を用いてLLMの出力をガイドすることで、それらの審査者が一貫して好む応答が得られます。第二に、人間の選好を説明する概念を用いてLaaJにプロンプトを与えることで、その選好予測が改善されます。全体として、本研究はLLM時代における説明可能性の新たなパラダイムを確立します。

English

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.

マルチドメイン選好説明可能性

Multi-Domain Explainability of Preferences

要旨

Support