先清洗，后对齐：为可靠大语言模型对齐进行偏好数据清洗的基准测试

摘要

人类反馈在使大型语言模型（LLMs）与人类偏好对齐方面起着关键作用。然而，此类反馈往往存在噪声或不一致，这可能会降低奖励模型的质量并阻碍对齐效果。尽管已提出多种自动化数据清洗方法以缓解这一问题，但对其有效性和泛化能力的系统性评估仍显不足。为填补这一空白，我们首次引入了一个全面基准，用于评估13种偏好数据清洗方法在LLM对齐背景下的表现。PrefCleanBench提供了一个标准化协议，以评估清洗策略在对齐性能及跨不同数据集、模型架构和优化算法上的泛化能力。通过统一并严格比较这些方法，我们揭示了决定数据清洗在对齐任务中成功与否的关键因素。该基准为通过提升数据质量来改进LLM对齐的规范化和可复现方法奠定了基础，凸显了数据预处理在负责任AI开发中至关重要但尚未充分探索的作用。我们发布了所有方法的模块化实现，以促进进一步研究：https://github.com/deeplearning-wisc/PrefCleanBench。

English

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

先清洗，后对齐：为可靠大语言模型对齐进行偏好数据清洗的基准测试

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

摘要

Support