CritiQ:從人類偏好中挖掘數據質量標準
CritiQ: Mining Data Quality Criteria from Human Preferences
February 26, 2025
作者: Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
cs.AI
摘要
語言模型的高度依賴於高品質數據以實現最佳性能。現有方法依賴於手動設計的啟發式方法、現有模型的困惑度、訓練分類器或精心的提示工程,這些方法需要大量的專家經驗和人工註釋工作,同時引入了偏差。我們提出了CritiQ,一種新穎的數據選擇方法,能夠僅使用30對人工註釋的樣本自動從人類偏好中挖掘數據質量標準,並進行高效的數據選擇。其主要組件CritiQ Flow採用管理代理來演化質量標準,並由工作代理進行成對判斷。我們構建了一個知識庫,從先前的工作中提取質量標準以增強CritiQ Flow。與基於困惑度和分類器的方法相比,語言標準更具可解釋性並具有可重用的價值。在推導出標準後,我們訓練CritiQ評分器來給出質量分數並進行高效的數據選擇。我們在代碼、數學和邏輯領域展示了該方法的有效性,在人工註釋的測試集上達到了高準確率。為了驗證所選數據的質量,我們持續訓練Llama 3.1模型,並觀察到在下游任務上的性能相比均勻採樣有所提升。消融研究驗證了知識庫和反思過程的益處。我們分析了標準如何演化以及多數投票的有效性。
English
Language model heavily depends on high-quality data for optimal performance.
Existing approaches rely on manually designed heuristics, the perplexity of
existing models, training classifiers, or careful prompt engineering, which
require significant expert experience and human annotation effort while
introduce biases. We introduce CritiQ, a novel data selection method that
automatically mines criteria from human preferences for data quality with only
sim30 human-annotated pairs and performs efficient data selection. The main
component, CritiQ Flow, employs a manager agent to evolve quality criteria and
worker agents to make pairwise judgments. We build a knowledge base that
extracts quality criteria from previous work to boost CritiQ Flow. Compared to
perplexity- and classifier- based methods, verbal criteria are more
interpretable and possess reusable value. After deriving the criteria, we train
the CritiQ Scorer to give quality scores and perform efficient data selection.
We demonstrate the effectiveness of our method in the code, math, and logic
domains, achieving high accuracy on human-annotated test sets. To validate the
quality of the selected data, we continually train Llama 3.1 models and observe
improved performance on downstream tasks compared to uniform sampling. Ablation
studies validate the benefits of the knowledge base and the reflection process.
We analyze how criteria evolve and the effectiveness of majority voting.Summary
AI-Generated Summary