ChatPaper.aiChatPaper

透過主成分分析重新思考多樣化人類偏好學習

Rethinking Diverse Human Preference Learning through Principal Component Analysis

February 18, 2025
作者: Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
cs.AI

摘要

理解人類偏好對於改進基礎模型和構建個性化AI系統至關重要。然而,偏好本質上具有多樣性和複雜性,這使得傳統的獎勵模型難以全面捕捉其範圍。雖然細粒度的偏好數據有所幫助,但收集這些數據成本高昂且難以擴展。本文介紹了一種新方法——分解獎勵模型(DRMs),它能夠從二元比較中提取多樣的人類偏好,而無需細粒度的註釋。我們的核心洞見是將人類偏好表示為向量,並使用主成分分析(PCA)進行分析。通過構建偏好與拒絕回應之間的嵌入差異數據集,DRMs識別出捕捉不同偏好方面的正交基向量。這些分解後的獎勵可以靈活組合,以滿足不同用戶需求,提供了一種可解釋且可擴展的替代方案,相較於傳統獎勵模型。我們證明,DRMs能有效提取有意義的偏好維度(如幫助性、安全性、幽默感),並能適應新用戶而無需額外訓練。我們的結果凸顯了DRMs作為個性化和可解釋大語言模型對齊的強大框架。
English
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.

Summary

AI-Generated Summary

PDF383February 19, 2025