WorldPM:擴展人類偏好建模
WorldPM: Scaling Human Preference Modeling
May 15, 2025
作者: Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin
cs.AI
摘要
受語言模型中的規模定律啟發,該定律展示了測試損失如何隨模型和數據集規模呈冪律關係擴展,我們發現偏好建模中也存在類似的定律。我們提出世界偏好建模(World Preference Modeling, WorldPM)以強調這種擴展潛力,其中世界偏好體現了人類偏好的統一表徵。在本文中,我們從涵蓋多樣化用戶群體的公共論壇收集偏好數據,並使用從15億到720億參數的模型進行大規模訓練,數據量達1500萬規模。我們在不同評估指標上觀察到明顯的模式:(1) 對抗性指標(識別欺騙特徵的能力)隨著訓練數據和基礎模型規模的增加而持續提升;(2) 客觀性指標(具有明確答案的客觀知識)在更大的語言模型中展現出湧現行為,凸顯了WorldPM的可擴展性潛力;(3) 主觀性指標(來自有限數量的人類或AI的主觀偏好)並未顯示出擴展趨勢。進一步的實驗驗證了WorldPM作為偏好微調基礎的有效性。通過在7個基準測試和20個子任務上的評估,我們發現WorldPM廣泛提升了不同規模(7K、100K和800K樣本)的人類偏好數據集的泛化性能,在許多關鍵子任務上性能提升超過5%。將WorldPM整合到我們內部的RLHF(基於人類反饋的強化學習)流程中,我們在內部評估集和公共評估集上均觀察到顯著改進,內部評估中的提升幅度達到4%至8%。
English
Motivated by scaling laws in language modeling that demonstrate how test loss
scales as a power law with model and dataset sizes, we find that similar laws
exist in preference modeling. We propose World Preference Modeling$ (WorldPM)
to emphasize this scaling potential, where World Preference embodies a unified
representation of human preferences. In this paper, we collect preference data
from public forums covering diverse user communities, and conduct extensive
training using 15M-scale data across models ranging from 1.5B to 72B
parameters. We observe distinct patterns across different evaluation metrics:
(1) Adversarial metrics (ability to identify deceptive features) consistently
scale up with increased training data and base model size; (2) Objective
metrics (objective knowledge with well-defined answers) show emergent behavior
in larger language models, highlighting WorldPM's scalability potential; (3)
Subjective metrics (subjective preferences from a limited number of humans or
AI) do not demonstrate scaling trends. Further experiments validate the
effectiveness of WorldPM as a foundation for preference fine-tuning. Through
evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly
improves the generalization performance across human preference datasets of
varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5%
on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we
observe significant improvements on both in-house and public evaluation sets,
with notable gains of 4% to 8% in our in-house evaluations.Summary
AI-Generated Summary