MAP-Neo: 고성능 및 투명한 이중 언어 대규모 언어 모델 시리즈

초록

대형 언어 모델(LLM)은 최근 몇 년 동안 다양한 작업에서 전례 없는 성능을 달성하며 큰 진전을 이루었습니다. 그러나 상업적 이익으로 인해 GPT, Gemini, Claude와 같은 가장 경쟁력 있는 모델들은 학습 세부 사항을 공개하지 않은 채 독점 인터페이스 뒤에 숨겨져 있습니다. 최근에는 LLaMA-3와 같은 여러 강력한 LLM이 오픈소스로 공개되어 기존의 폐쇄형 LLM과 견줄 만한 성능을 보여주고 있습니다. 그러나 모델의 가중치만 제공되며 중간 체크포인트, 사전 학습 코퍼스, 학습 코드 등 대부분의 세부 사항은 공개되지 않고 있습니다. LLM의 투명성을 높이기 위해 연구 커뮤니티는 Pythia, Amber, OLMo와 같은 진정한 오픈소스 LLM을 공개하며, 사전 학습 코퍼스와 학습 코드 등 더 많은 세부 사항을 제공하고 있습니다. 이러한 모델들은 대형 모델의 강점, 약점, 편향 및 위험을 포함한 과학적 연구를 크게 진전시켰습니다. 그러나 우리는 기존의 진정한 오픈소스 LLM이 추론, 지식, 코딩 작업에서 유사한 크기의 최신 LLM에 비해 여전히 열등하다는 점을 관찰했습니다. 이를 위해 우리는 4.5T의 고품질 토큰으로 처음부터 학습된 70억 개의 매개변수를 가진 고성능 및 투명한 이중 언어 모델인 MAP-Neo를 오픈소스로 공개합니다. 우리의 MAP-Neo는 기존의 최신 LLM과 비교할 만한 성능을 가진 최초의 완전 오픈소스 이중 언어 LLM입니다. 또한, 우리는 MAP-Neo를 재현하기 위한 모든 세부 사항을 오픈소스로 공개하며, 정제된 사전 학습 코퍼스, 데이터 정제 파이프라인, 체크포인트, 최적화된 학습/평가 프레임워크를 제공합니다. 마지막으로, 우리의 MAP-Neo가 오픈 연구 커뮤니티를 강화하고 더 많은 혁신과 창의성을 불러일으켜 LLM의 추가 개선을 촉진하기를 바랍니다.

English

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

MAP-Neo: 고성능 및 투명한 이중 언어 대규모 언어 모델 시리즈

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

초록

Support