스카이워크: 더 개방적인 이중 언어 기초 모델

초록

본 기술 보고서에서는 영어와 중국어 텍스트로부터 추출된 3.2조 개 이상의 토큰으로 구성된 코퍼스로 학습된 대규모 언어 모델(LLM)인 Skywork-13B 시리즈를 소개합니다. 이 이중 언어 기반 모델은 현재까지 공개된 동급 규모의 LLM 중 가장 광범위하게 학습된 모델입니다. 우리는 세그먼트화된 코퍼스를 사용한 2단계 학습 방법론을 제안하며, 각각 일반 목적 학습과 도메인 특화 강화 학습을 목표로 합니다. 우리 모델은 인기 벤치마크에서 우수한 성능을 보일 뿐만 아니라, 다양한 도메인에서 중국어 언어 모델링 분야의 최첨단 성능을 달성함을 보여줍니다. 또한, 우리는 새로운 누출 탐지 방법을 제안하며, 테스트 데이터 오염이 LLM 커뮤니티에서 추가 조사가 필요한 시급한 문제임을 입증합니다. 향후 연구를 촉진하기 위해, 우리는 Skywork-13B와 학습 과정 중간 단계에서 얻은 체크포인트를 공개합니다. 또한, 웹 텍스트로부터 수집된 1,500억 개 이상의 토큰으로 구성된 SkyPile 코퍼스의 일부를 공개하며, 이는 현재까지 공개된 가장 큰 고품질 중국어 사전 학습 코퍼스입니다. 우리는 Skywork-13B와 공개 코퍼스가 고품질 LLM에 대한 접근을 민주화하는 가치 있는 오픈소스 자원으로 활용되기를 기대합니다.

English

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

스카이워크: 더 개방적인 이중 언어 기초 모델

Skywork: A More Open Bilingual Foundation Model

초록

Support