MiCRo:基於混合建模與情境感知路由的個人化偏好學習
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
May 30, 2025
作者: Jingyan Shen, Jiarui Yao, Rui Yang, Yifan Sun, Feng Luo, Rui Pan, Tong Zhang, Han Zhao
cs.AI
摘要
獎勵建模是應用基於人類反饋的強化學習(RLHF)來對齊大型語言模型(LLMs)以構建安全基礎模型的關鍵步驟。然而,基於布拉德利-特里(BT)模型的獎勵建模假設存在一個全局獎勵函數,無法捕捉到人類偏好固有的多樣性和異質性。因此,這種過度簡化限制了LLMs在支持個性化和多元對齊方面的能力。理論上,我們證明當人類偏好遵循多樣子群的混合分佈時,單一的BT模型存在不可約誤差。雖然現有解決方案,如帶有細粒度註釋的多目標學習,有助於解決這一問題,但它們成本高昂且受預定義屬性的限制,無法完全捕捉人類價值的豐富性。在本研究中,我們引入了MiCRo,這是一個兩階段框架,通過利用大規模二元偏好數據集來增強個性化偏好學習,而無需顯式的細粒度註釋。在第一階段,MiCRo引入了上下文感知的混合建模方法來捕捉多樣的人類偏好。在第二階段,MiCRo整合了一種在線路由策略,該策略根據特定上下文動態調整混合權重以解決歧義,從而實現高效且可擴展的偏好適應,並只需最少的額外監督。在多個偏好數據集上的實驗表明,MiCRo有效地捕捉了多樣的人類偏好,並顯著提升了下游的個性化效果。
English
Reward modeling is a key step in building safe foundation models when
applying reinforcement learning from human feedback (RLHF) to align Large
Language Models (LLMs). However, reward modeling based on the Bradley-Terry
(BT) model assumes a global reward function, failing to capture the inherently
diverse and heterogeneous human preferences. Hence, such oversimplification
limits LLMs from supporting personalization and pluralistic alignment.
Theoretically, we show that when human preferences follow a mixture
distribution of diverse subgroups, a single BT model has an irreducible error.
While existing solutions, such as multi-objective learning with fine-grained
annotations, help address this issue, they are costly and constrained by
predefined attributes, failing to fully capture the richness of human values.
In this work, we introduce MiCRo, a two-stage framework that enhances
personalized preference learning by leveraging large-scale binary preference
datasets without requiring explicit fine-grained annotations. In the first
stage, MiCRo introduces context-aware mixture modeling approach to capture
diverse human preferences. In the second stage, MiCRo integrates an online
routing strategy that dynamically adapts mixture weights based on specific
context to resolve ambiguity, allowing for efficient and scalable preference
adaptation with minimal additional supervision. Experiments on multiple
preference datasets demonstrate that MiCRo effectively captures diverse human
preferences and significantly improves downstream personalization.