机器学习生态系统剖析：Hugging Face上的200万模型

摘要

许多人注意到，生成式机器学习（ML）和人工智能（AI）模型的开发与部署遵循一种独特模式，即预训练模型会被调整和微调以适应特定的下游任务。然而，关于这些互动结构的实证研究却相对有限。本文分析了Hugging Face这一领先的模型开发众产平台上的186万个模型。通过对模型家族树——将微调模型与其基础或父模型连接起来的网络——的研究，我们发现了规模与结构各异的广泛微调谱系。借鉴进化生物学的视角来研究ML模型，我们利用模型元数据和模型卡片来衡量模型家族间的遗传相似性和特征变异。我们发现，模型往往表现出家族相似性，即当它们属于同一模型家族时，其遗传标记和特征的重叠程度更高。然而，这些相似性在某些方面与标准的无性繁殖模型有所不同，因为变异快速且具有方向性，导致两个“兄弟”模型之间的相似性往往高于父子模型对。对这些变异方向性漂移的进一步分析揭示了开放机器学习生态系统的定性洞察：许可证出人意料地从限制性商业许可证向宽松或版权左许可证漂移，常常违反上游许可证的条款；模型从多语言兼容性向仅英语兼容性演变；模型卡片通过更多地转向模板和自动生成文本，长度缩短并趋于标准化。总体而言，这项工作朝着基于实证的模型微调理解迈出了一步，并表明生态模型和方法能够带来新颖的科学洞见。

English

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

机器学习生态系统剖析：Hugging Face上的200万模型

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

摘要

Support