CoLoR-Filter: 타겟 언어 모델 사전 학습을 위한 조건부 손실 감소 필터링

초록

사전 학습을 위한 고품질 데이터를 선택하는 것은 언어 모델의 다운스트림 작업 성능을 결정하는 데 있어 매우 중요합니다. 주요 과제는 이러한 최적의 부분집합을 식별하는 데 있으며, 이 문제는 일반적으로 다루기 어려운 것으로 간주되어 확장 가능하고 효과적인 휴리스틱이 필요합니다. 본 연구에서는 두 개의 보조 모델의 상대적 손실 값을 기반으로 단순하고 계산적으로 효율적인 선택 기준을 도출하기 위해 경험적 베이즈 접근법을 활용한 데이터 선택 방법인 CoLoR-Filter(Conditional Loss Reduction Filtering)를 제안합니다. 모델링 이론 외에도, 우리는 CoLoR-Filter를 두 가지 언어 모델링 작업에서 실증적으로 평가합니다: (1) C4에서 데이터를 선택하여 Books에 대한 도메인 적응을 평가하고, (2) C4에서 데이터를 선택하여 다운스트림 다중 선택 질문 응답 작업 세트를 평가합니다. 우리는 더 공격적으로 부분 선택을 할 때와 작은 보조 모델을 사용하여 대형 목표 모델을 위한 데이터를 선택할 때 모두 유리한 확장성을 보여줍니다. 주요 결과 중 하나로, 150m 매개변수의 보조 모델 쌍을 사용하여 선택한 CoLoR-Filter 데이터는 1.2b 매개변수의 목표 모델을 훈련시켜, Books 작업에서는 25배 적은 데이터로, 다운스트림 작업에서는 11배 적은 데이터로 25b 무작위 선택 토큰으로 훈련된 1.2b 매개변수 모델과 동등한 성능을 달성할 수 있습니다. 코드: https://github.com/davidbrandfonbrener/color-filter-olmo 필터링된 데이터: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

English

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

CoLoR-Filter: 타겟 언어 모델 사전 학습을 위한 조건부 손실 감소 필터링

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

초록

Summary

Support

Support