MiCRo：パーソナライズド選好学習のための混合モデリングとコンテキスト認識ルーティング

要旨

報酬モデリングは、大規模言語モデル（LLM）を人間のフィードバックによる強化学習（RLHF）を用いて整合させる際に、安全な基盤モデルを構築するための重要なステップです。しかし、Bradley-Terry（BT）モデルに基づく報酬モデリングは、グローバルな報酬関数を仮定しており、人間の多様で異質な選好を捉えることができません。そのため、このような過度の単純化は、LLMがパーソナライゼーションや多元的な整合をサポートすることを制限しています。理論的には、人間の選好が多様なサブグループの混合分布に従う場合、単一のBTモデルには還元不可能な誤差が生じることを示します。既存の解決策、例えば細かい注釈を用いた多目的学習などは、この問題に対処するのに役立ちますが、コストがかかり、事前に定義された属性に制約されるため、人間の価値観の豊かさを完全に捉えることができません。本研究では、MiCRoという二段階のフレームワークを提案します。このフレームワークは、明示的な細かい注釈を必要とせずに、大規模な二値選好データセットを活用してパーソナライズされた選好学習を強化します。第一段階では、MiCRoは文脈を考慮した混合モデリングアプローチを導入し、多様な人間の選好を捉えます。第二段階では、MiCRoはオンラインルーティング戦略を統合し、特定の文脈に基づいて混合重みを動的に適応させ、曖昧さを解消します。これにより、最小限の追加監督で効率的かつスケーラブルな選好適応が可能になります。複数の選好データセットを用いた実験により、MiCRoが多様な人間の選好を効果的に捉え、下流のパーソナライゼーションを大幅に改善することが実証されました。

English

Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.

MiCRo：パーソナライズド選好学習のための混合モデリングとコンテキスト認識ルーティング

MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning

要旨

Support