WorldPM:扩展人类偏好建模
WorldPM: Scaling Human Preference Modeling
May 15, 2025
作者: Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin
cs.AI
摘要
受语言建模中规模定律的启发,该定律展示了测试损失如何随模型和数据集规模呈幂律关系扩展,我们发现偏好建模中也存在类似的规律。我们提出世界偏好建模(WorldPM)以强调这种扩展潜力,其中世界偏好体现了人类偏好的统一表征。本文中,我们从涵盖多样化用户社区的公共论坛收集偏好数据,并在参数规模从15亿到720亿不等的模型上进行了1500万规模数据的广泛训练。我们观察到不同评估指标间的显著模式:(1)对抗性指标(识别欺骗性特征的能力)随着训练数据和基础模型规模的增加而持续提升;(2)客观性指标(具有明确答案的客观知识)在更大语言模型中展现出涌现行为,凸显了WorldPM的可扩展性潜力;(3)主观性指标(来自有限数量人类或AI的主观偏好)并未显示出扩展趋势。进一步实验验证了WorldPM作为偏好微调基础的有效性。通过对7个基准测试的20个子任务进行评估,我们发现WorldPM在规模各异的人类偏好数据集(7K、100K和800K样本)上普遍提升了泛化性能,许多关键子任务的性能提升超过5%。将WorldPM整合进我们的内部RLHF流程后,我们在内部和公共评估集上均观察到显著改进,内部评估中的提升幅度达到4%至8%。
English
Motivated by scaling laws in language modeling that demonstrate how test loss
scales as a power law with model and dataset sizes, we find that similar laws
exist in preference modeling. We propose World Preference Modeling$ (WorldPM)
to emphasize this scaling potential, where World Preference embodies a unified
representation of human preferences. In this paper, we collect preference data
from public forums covering diverse user communities, and conduct extensive
training using 15M-scale data across models ranging from 1.5B to 72B
parameters. We observe distinct patterns across different evaluation metrics:
(1) Adversarial metrics (ability to identify deceptive features) consistently
scale up with increased training data and base model size; (2) Objective
metrics (objective knowledge with well-defined answers) show emergent behavior
in larger language models, highlighting WorldPM's scalability potential; (3)
Subjective metrics (subjective preferences from a limited number of humans or
AI) do not demonstrate scaling trends. Further experiments validate the
effectiveness of WorldPM as a foundation for preference fine-tuning. Through
evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly
improves the generalization performance across human preference datasets of
varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5%
on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we
observe significant improvements on both in-house and public evaluation sets,
with notable gains of 4% to 8% in our in-house evaluations.Summary
AI-Generated Summary