機器學習生態系統剖析：Hugging Face上的200萬個模型

摘要

許多人觀察到，生成式機器學習（ML）和人工智慧（AI）模型的開發與部署遵循一種獨特的模式，即預訓練模型會被調整和微調以適應特定的下游任務。然而，探討這些互動結構的實證研究卻相當有限。本文分析了Hugging Face（一個領先的模型開發同儕生產平台）上的186萬個模型。我們對模型家族樹（將微調模型與其基礎或父模型連接起來的網絡）的研究揭示了規模和結構各異的廣泛微調譜系。透過借用演化生物學的視角來研究ML模型，我們利用模型元數據和模型卡片來衡量模型家族間的遺傳相似性和特徵變異。我們發現，模型往往表現出家族相似性，這意味著當它們屬於同一模型家族時，其遺傳標記和特徵會展現出更多的重疊。然而，這些相似性在某些方面與標準的無性繁殖模型有所不同，因為變異快速且具有方向性，使得兩個「兄弟」模型之間的相似性往往高於父子模型對。進一步分析這些變異的方向性漂移，揭示了開放機器學習生態系統的定性洞察：許可證出人意料地從限制性商業許可證轉向寬鬆或著佐權許可證，這通常違反了上游許可證的條款；模型從多語言兼容性演變為僅支持英語；模型卡片通過更多地轉向模板和自動生成文本，長度縮短並趨於標準化。總體而言，這項工作朝著基於實證的模型微調理解邁進了一步，並表明生態模型和方法能夠帶來新穎的科學洞見。

English

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

機器學習生態系統剖析：Hugging Face上的200萬個模型

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

摘要

Support