关于LLM的起源：15821个大型语言模型的演化树和图

On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models

July 19, 2023

作者: Sarah Gao, Andrew Kean Gao

cs.AI

摘要

自2022年底以来，大型语言模型（LLMs）变得非常突出，像ChatGPT和Bard这样的LLMs拥有数百万用户。每周都会有数百个新的LLMs被宣布，其中许多被存储在Hugging Face，一个机器学习模型和数据集的存储库中。迄今为止，将近16,000个文本生成模型已上传到该网站。鉴于LLMs的大量涌入，了解哪些LLM骨干、设置、训练方法和系列受欢迎或趋势是很有趣的。然而，目前并没有可用的LLMs综合索引。我们利用Hugging Face LLMs相对系统化的命名法进行层次聚类，利用n-grams和词项频率-逆文档频率识别LLMs之间的社区。我们的方法成功地识别了LLMs的系列，并将LLMs准确地聚类为有意义的子群。我们提供了一个公共网络应用程序，用于浏览和探索Constellation，我们的15,821个LLMs的地图集。Constellation快速生成各种可视化，包括树状图、图表、词云和散点图。Constellation可通过以下链接访问：https://constellation.sites.stanford.edu/。

English

Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: https://constellation.sites.stanford.edu/.