MiCRo：面向个性化偏好学习的混合建模与上下文感知路由

摘要

奖励建模是应用基于人类反馈的强化学习（RLHF）对齐大型语言模型（LLMs）以构建安全基础模型的关键步骤。然而，基于Bradley-Terry（BT）模型的奖励建模假设存在一个全局奖励函数，未能捕捉到人类偏好固有的多样性和异质性。因此，这种过度简化限制了LLMs在个性化和多元化对齐方面的支持能力。理论上，我们证明当人类偏好遵循多样子群体的混合分布时，单一的BT模型存在不可约误差。尽管现有解决方案，如带有细粒度注释的多目标学习，有助于解决这一问题，但它们成本高昂且受限于预定义属性，未能充分捕捉人类价值观的丰富性。在本研究中，我们提出了MiCRo，一个两阶段框架，通过利用大规模二元偏好数据集而不需要显式的细粒度注释，增强了个性化偏好学习。在第一阶段，MiCRo引入了上下文感知的混合建模方法以捕捉多样的人类偏好。在第二阶段，MiCRo集成了在线路由策略，该策略根据特定上下文动态调整混合权重以解决歧义，从而在最小额外监督下实现高效且可扩展的偏好适应。在多个偏好数据集上的实验表明，MiCRo有效捕捉了多样的人类偏好，并显著提升了下游个性化效果。

English

Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.

MiCRo：面向个性化偏好学习的混合建模与上下文感知路由

MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning

摘要

Support