ChatPaper.aiChatPaper

關於LLM的起源:15821個大型語言模型的演化樹和圖

On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models

July 19, 2023
作者: Sarah Gao, Andrew Kean Gao
cs.AI

摘要

自2022年底以來,大型語言模型(LLMs)已變得非常突出,像ChatGPT和Bard這樣的LLMs已經吸引了數百萬用戶。每週都會宣布數百種新的LLMs,其中許多被存放在Hugging Face,這是一個機器學習模型和數據集的存儲庫。迄今為止,將近16,000個文本生成模型已上傳到該網站。鑒於LLMs的大量湧入,了解哪些LLM骨幹、設置、訓練方法和系列受歡迎或趨勢是很有趣的。然而,目前沒有可用的LLMs綜合指標。我們利用Hugging Face LLMs相對系統化的命名法,通過n-grams和詞頻-逆文檔頻率執行階層聚類,識別LLMs之間的社區。我們的方法成功識別LLMs系列並將LLMs準確地聚類為有意義的子組。我們提供了一個公共網絡應用程序,用於瀏覽和探索Constellation,我們的15,821個LLMs的地圖集。Constellation快速生成各種可視化,包括階層圖、圖形、詞雲和散點圖。Constellation可在以下鏈接獲取:https://constellation.sites.stanford.edu/。
English
Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: https://constellation.sites.stanford.edu/.
PDF478December 15, 2024