Falcon LLM을 위한 RefinedWeb 데이터셋: 큐레이션 코퍼스를 능가하는 웹 데이터, 그리고 웹 데이터만으로

초록

대형 언어 모델은 일반적으로 필터링된 웹 데이터와 소셜 미디어 대화, 도서, 기술 논문 등과 같은 정제된 고품질 코퍼스의 혼합물로 학습됩니다. 이러한 정제 과정은 광범위한 제로샷 일반화 능력을 가진 성능 좋은 모델을 생산하기 위해 필요하다고 여겨져 왔습니다. 그러나 수조 개의 토큰을 사전 학습해야 하는 더 큰 모델들이 고려되면서, 이러한 정제 과정의 확장성과 고품질 데이터의 고갈 가능성에 대한 의문이 제기되고 있습니다. 기존의 믿음과는 달리, 우리는 적절히 필터링되고 중복 제거된 웹 데이터만으로도 강력한 모델을 만들 수 있으며, 심지어 The Pile에서 학습된 최첨단 모델을 크게 능가할 수 있음을 보여줍니다. 광범위한 필터링에도 불구하고, 우리가 웹에서 추출한 고품질 데이터는 여전히 풍부하며, CommonCrawl에서 5조 개의 토큰을 확보할 수 있었습니다. 우리는 RefinedWeb 데이터셋에서 추출한 6000억 개의 토큰과 이를 기반으로 학습된 1.3/7.5B 파라미터 언어 모델을 공개적으로 공개합니다.

English

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Falcon LLM을 위한 RefinedWeb 데이터셋: 큐레이션 코퍼스를 능가하는 웹 데이터, 그리고 웹 데이터만으로

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

초록

Support