CritiQ: 인간 선호도에서 데이터 품질 기준 추출하기

초록

언어 모델은 최적의 성능을 위해 고품질 데이터에 크게 의존합니다. 기존 접근 방식은 수동으로 설계된 휴리스틱, 기존 모델의 perplexity, 분류기 훈련 또는 신중한 프롬프트 엔지니어링에 의존하는데, 이는 상당한 전문가 경험과 인간 주석 노력을 요구하면서도 편향을 도입합니다. 우리는 CritiQ라는 새로운 데이터 선택 방법을 소개합니다. 이 방법은 단 30개의 인간 주석 쌍만으로 인간 선호도에서 데이터 품질 기준을 자동으로 추출하고 효율적인 데이터 선택을 수행합니다. 주요 구성 요소인 CritiQ Flow는 품질 기준을 발전시키는 관리자 에이전트와 쌍별 판단을 내리는 작업자 에이전트를 사용합니다. 우리는 CritiQ Flow를 강화하기 위해 이전 연구에서 품질 기준을 추출한 지식 기반을 구축합니다. Perplexity 및 분류기 기반 방법과 비교하여, 언어적 기준은 더 해석 가능하고 재사용 가능한 가치를 지닙니다. 기준을 도출한 후, 우리는 CritiQ Scorer를 훈련시켜 품질 점수를 부여하고 효율적인 데이터 선택을 수행합니다. 우리는 코드, 수학 및 논리 영역에서 이 방법의 효과를 입증하며, 인간 주석 테스트 세트에서 높은 정확도를 달성합니다. 선택된 데이터의 품질을 검증하기 위해, 우리는 Llama 3.1 모델을 지속적으로 훈련시키고 균일 샘플링에 비해 다운스트림 작업에서 향상된 성능을 관찰합니다. Ablation 연구는 지식 기반과 반성 프로세스의 이점을 검증합니다. 우리는 기준이 어떻게 진화하는지와 다수결 투표의 효과를 분석합니다.

English

Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only sim30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.

CritiQ: 인간 선호도에서 데이터 품질 기준 추출하기

CritiQ: Mining Data Quality Criteria from Human Preferences

초록

Support