머신러닝 생태계의 해부학: Hugging Face의 200만 개 모델

초록

많은 이들이 생성적 기계 학습(ML) 및 인공지능(AI) 모델의 개발과 배포가 사전 훈련된 모델을 특정 하위 작업에 맞게 적응하고 미세 조정하는 독특한 패턴을 따른다는 것을 관찰해 왔다. 그러나 이러한 상호작용의 구조를 조사한 실증적 연구는 제한적이다. 본 논문은 모델 개발을 위한 선도적인 동료 생산 플랫폼인 Hugging Face에 있는 186만 개의 모델을 분석한다. 모델 계보 트리(미세 조정된 모델을 기본 또는 부모 모델과 연결하는 네트워크)에 대한 우리의 연구는 크기와 구조가 매우 다양한 광범위한 미세 조정 계보를 보여준다. ML 모델을 연구하기 위해 진화 생물학적 렌즈를 사용하여, 우리는 모델 메타데이터와 모델 카드를 활용해 모델 계열 간의 유전적 유사성과 특성 변이를 측정한다. 우리는 모델들이 가족적 유사성을 보이는 경향이 있음을 발견했는데, 이는 동일한 모델 계열에 속할 때 그들의 유전적 표지와 특성이 더 많은 중첩을 보인다는 것을 의미한다. 그러나 이러한 유사성은 무성 생식의 표준 모델과는 특정 방식에서 벗어나는데, 변이가 빠르고 방향성이 있기 때문에 두 '형제' 모델이 부모/자식 쌍보다 더 많은 유사성을 보이는 경향이 있다. 이러한 변이의 방향성 표류에 대한 추가 분석은 개방형 기계 학습 생태계에 대한 질적 통찰을 제공한다: 라이선스는 직관과 달리 제한적인 상업용 라이선스에서 허용적이거나 카피레프트 라이선스로 표류하며, 이는 종종 상위 라이선스의 조건을 위반하는 경우이다; 모델은 다국어 호환성에서 영어 전용 호환성으로 진화한다; 모델 카드는 길이가 줄어들고 템플릿과 자동 생성 텍스트로 더 자주 전환함으로써 표준화된다. 전반적으로, 이 연구는 모델 미세 조정에 대한 실증적 이해를 위한 한 걸음을 내딛으며, 생태학적 모델과 방법이 새로운 과학적 통찰을 제공할 수 있음을 시사한다.

English

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

머신러닝 생태계의 해부학: Hugging Face의 200만 개 모델

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

초록

Support