스케일을 넘어서: 데이터 품질 지표로서의 다양성 계수 대형 언어 모델이 형식적으로 다양한 데이터로 사전 학습되었음을 입증

초록

현재 대규모 언어 모델(LLMs)을 사전 학습시키는 주요 동향은 모델과 데이터셋 규모의 확장에 초점을 맞추고 있다. 그러나 사전 학습 데이터의 품질은 강력한 LLMs를 훈련시키는 데 중요한 요소임에도 불구하고, 이는 아직 완전히 규명되지 않은 모호한 개념이다. 따라서 본 연구에서는 최근 제안된 Task2Vec 다양성 계수를 활용하여 데이터 품질의 형식적 측면을 이해하고, 단순한 규모를 넘어서고자 한다. 구체적으로, 공개된 사전 학습 데이터셋의 다양성 계수를 측정하여 이들의 형식적 다양성이 이론적 하한 및 상한과 비교할 때 높다는 것을 입증한다. 또한, 다양성 계수에 대한 신뢰를 구축하기 위해 해석 가능성 실험을 수행하고, 이 계수가 직관적인 다양성 특성과 일치함을 확인한다. 예를 들어, 잠재 개념의 수가 증가함에 따라 계수도 증가한다. 결론적으로, 다양성 계수는 신뢰할 만하며, 공개된 LLM 데이터셋에서 높은 값을 보인다. 이를 통해 LLMs를 위한 유용한 다양한 데이터셋을 구축하는 데 활용할 수 있을 것으로 추측한다.

English

Current trends to pre-train capable Large Language Models (LLMs) mostly focus on scaling of model and dataset size. However, the quality of pre-training data is an important factor for training powerful LLMs, yet it is a nebulous concept that has not been fully characterized. Therefore, we use the recently proposed Task2Vec diversity coefficient to ground and understand formal aspects of data quality, to go beyond scale alone. Specifically, we measure the diversity coefficient of publicly available pre-training datasets to demonstrate that their formal diversity is high when compared to theoretical lower and upper bounds. In addition, to build confidence in the diversity coefficient, we conduct interpretability experiments and find that the coefficient aligns with intuitive properties of diversity, e.g., it increases as the number of latent concepts increases. We conclude the diversity coefficient is reliable, show it's high for publicly available LLM datasets, and conjecture it can be used to build useful diverse datasets for LLMs.

스케일을 넘어서: 데이터 품질 지표로서의 다양성 계수 대형 언어 모델이 형식적으로 다양한 데이터로 사전 학습되었음을 입증

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

초록

Support