先清潔，後對齊：可靠大語言模型對齊中的偏好數據清理基準測試

摘要

人類反饋在對齊大型語言模型（LLMs）與人類偏好方面發揮著關鍵作用。然而，此類反饋往往存在噪聲或不一致性，這可能降低獎勵模型的質量並阻礙對齊效果。儘管已提出多種自動化數據清洗方法以緩解此問題，但對其有效性和泛化能力的系統性評估仍顯不足。為彌補這一空白，我們首次引入了一個全面的基準測試，用於評估在LLM對齊背景下的13種偏好數據清洗方法。PrefCleanBench提供了一個標準化協議，以評估清洗策略在對齊性能及跨多樣數據集、模型架構和優化算法上的泛化能力。通過統一並嚴格比較這些方法，我們揭示了決定數據清洗在對齊任務中成功的關鍵因素。此基準測試為通過提升數據質量來改進LLM對齊的系統性和可重現性方法奠定了基礎，凸顯了數據預處理在負責任的人工智能開發中至關重要但尚未充分探索的角色。我們發布了所有方法的模塊化實現，以促進進一步研究：https://github.com/deeplearning-wisc/PrefCleanBench。

English

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

先清潔，後對齊：可靠大語言模型對齊中的偏好數據清理基準測試

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

摘要

Support