XGen-7B 技术报告

摘要

大型语言模型（LLMs）已经在各个领域变得无处不在，改变了我们与信息互动和进行研究的方式。然而，大多数表现优异的LLMs仍然被限制在专有墙壁之后，阻碍了科学进展。另一方面，大多数开源LLMs在支持更长序列长度方面存在局限，而这是许多需要在输入上下文上进行推理的任务的关键要求。为了解决这个问题，我们训练了XGen，一系列拥有70亿参数模型，支持长达8K序列长度，最多达1.5T标记。我们还对XGen模型在公共领域的指导性数据上进行了微调，创建了它们的指导性调整版本（XGen-Inst）。我们开源我们的模型，旨在促进研究进展和商业应用。我们在标准基准测试上的评估显示，与最先进的开源LLMs相比，XGen模型取得了可比或更好的结果。我们针对长序列建模任务的定向评估显示，我们的8K序列模型相对于2K序列开源LLMs具有优势。

English

Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many tasks that require inference over an input context. To address this, we have trained XGen, a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens. We have also finetuned the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-Inst). We open-source our models for both research advancements and commercial applications. Our evaluation on standard benchmarks shows that XGen models achieve comparable or better results when compared with state-of-the-art open-source LLMs. Our targeted evaluation on long sequence modeling tasks shows the benefits of our 8K-sequence models over 2K-sequence open-source LLMs.

XGen-7B 技术报告

XGen-7B Technical Report

摘要

Support