LLM360：走向完全透明的开源LLM模型

摘要

最近开源大型语言模型（LLMs）的激增，如LLaMA、Falcon和Mistral，为人工智能从业者和研究人员提供了多样选择。然而，大多数LLMs仅发布了部分工件，如最终模型权重或推理代码，技术报告日益限制其范围，仅涉及高级设计选择和表面统计数据。这些选择阻碍了该领域的进展，降低了对LLMs训练过程的透明度，迫使团队重新发现训练过程中的许多细节。我们提出LLM360，这是一个全面开源LLMs的倡议，主张将所有训练代码和数据、模型检查点以及中间结果提供给社区。LLM360的目标是通过使端到端的LLMs训练过程对每个人透明和可重现，支持开放和协作的人工智能研究。作为LLM360的第一步，我们发布了两个从头开始预训练的7B参数LLMs，分别是Amber和CrystalCoder，包括它们的训练代码、数据、中间检查点和分析（网址为https://www.llm360.ai）。我们致力于通过这一开源努力不断拓展LLMs的边界。更大规模和更强大的模型正在制作中，并将在未来发布。

English

The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.

LLM360：走向完全透明的开源LLM模型

LLM360: Towards Fully Transparent Open-Source LLMs

摘要

Support