機械学習エコシステムの解剖学：Hugging Face上の200万モデル

要旨

多くの研究者が指摘しているように、生成機械学習（ML）および人工知能（AI）モデルの開発と展開は、事前学習済みモデルが特定の下流タスクに適応・微調整されるという独特のパターンに従っています。しかし、これらの相互作用の構造を検証する実証研究は限られています。本論文では、モデル開発の主要なピア生産プラットフォームであるHugging Face上の186万のモデルを分析します。モデルファミリーツリー（微調整されたモデルをその基盤または親モデルに接続するネットワーク）の研究を通じて、サイズと構造が大きく異なる広範な微調整の系譜を明らかにします。進化生物学の視点を用いてMLモデルを研究し、モデルのメタデータとモデルカードを使用して、モデルファミリー間の遺伝的類似性と特性の変異を測定します。モデルはファミリー類似性を示す傾向があり、同じモデルファミリーに属する場合、その遺伝的マーカーと特性がより重複することがわかります。しかし、これらの類似性は無性生殖の標準モデルとは異なる点があり、変異が迅速かつ方向性を持っているため、2つの「兄弟」モデルは親子ペアよりも類似性が高くなる傾向があります。さらに、これらの変異の方向性の分析から、オープンな機械学習エコシステムに関する質的洞察が得られます。ライセンスは直感に反して、制限的な商用ライセンスから寛容またはコピーレフトライセンスへと移行し、しばしば上流ライセンスの条件に違反しています。モデルは多言語互換性から英語のみの互換性へと進化し、モデルカードは長さを短縮し、テンプレートや自動生成テキストを使用することで標準化されています。全体として、この研究はモデルの微調整に関する実証的な理解に向けた一歩を踏み出し、生態学的モデルと方法が新たな科学的洞察をもたらす可能性を示唆しています。

English

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

機械学習エコシステムの解剖学：Hugging Face上の200万モデル

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

要旨

Support