MiCRo: 개인화된 선호도 학습을 위한 혼합 모델링 및 상황 인식 라우팅

초록

보상 모델링은 인간 피드백을 통한 강화 학습(RLHF)을 적용하여 대규모 언어 모델(LLM)을 정렬할 때 안전한 기초 모델을 구축하는 데 있어 핵심 단계입니다. 그러나 Bradley-Terry(BT) 모델에 기반한 보상 모델링은 전역 보상 함수를 가정함으로써 본질적으로 다양하고 이질적인 인간 선호도를 포착하지 못합니다. 따라서 이러한 지나친 단순화는 LLM이 개인화와 다원적 정렬을 지원하는 데 한계를 가져옵니다. 이론적으로, 인간 선호도가 다양한 하위 그룹의 혼합 분포를 따를 경우 단일 BT 모델은 줄일 수 없는 오류를 갖는다는 것을 보여줍니다. 기존의 해결책들, 예를 들어 세분화된 주석을 활용한 다목적 학습 등은 이 문제를 해결하는 데 도움을 주지만, 비용이 많이 들고 미리 정의된 속성에 제약을 받아 인간 가치의 풍부함을 완전히 포착하지 못합니다. 본 연구에서는 명시적인 세분화된 주석 없이도 대규모 이진 선호도 데이터셋을 활용하여 개인화된 선호도 학습을 강화하는 두 단계 프레임워크인 MiCRo를 소개합니다. 첫 번째 단계에서 MiCRo는 다양한 인간 선호도를 포착하기 위해 상황 인식 혼합 모델링 접근법을 도입합니다. 두 번째 단계에서는 특정 상황에 따라 혼합 가중치를 동적으로 조정하여 모호성을 해결하는 온라인 라우팅 전략을 통합함으로써 최소한의 추가 감독으로도 효율적이고 확장 가능한 선호도 적응을 가능하게 합니다. 여러 선호도 데이터셋에 대한 실험을 통해 MiCRo가 다양한 인간 선호도를 효과적으로 포착하고 하위 작업에서의 개인화를 크게 개선함을 입증합니다.

English

Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.

MiCRo: 개인화된 선호도 학습을 위한 혼합 모델링 및 상황 인식 라우팅

MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning

초록

Support