SmolLM2:當小型模型變得強大——小型語言模型的資料中心訓練
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
February 4, 2025
作者: Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf
cs.AI
摘要
儘管大型語言模型促進了人工智慧許多應用的突破,但其固有的龐大尺寸使其在資源受限的環境中難以部署並且需要大量計算資源。本文記錄了SmolLM2的開發,這是一個最先進的「小型」(17億參數)語言模型(LM)。為了達到優異的性能,我們對SmolLM2進行了超級訓練,使用了約11兆標記數據,採用了混合網絡文本與專業數學、代碼和指令跟隨數據的多階段訓練過程。我們還在現有數據集過小或質量不佳的階段引入了新的專業數據集(FineMath、Stack-Edu和SmolTalk)。為了指導我們的設計決策,我們進行了小規模消融實驗,並進行了手動精煉過程,根據前一階段的性能更新每個階段的數據集混合比率。最終,我們展示了SmolLM2優於其他最近的小型LM,包括Qwen2.5-1.5B和Llama3.2-1B。為了促進LM開發以及小型LM應用的未來研究,我們釋出了SmolLM2以及在項目進行過程中準備的所有數據集。
English
While large language models have facilitated breakthroughs in many
applications of artificial intelligence, their inherent largeness makes them
computationally expensive and challenging to deploy in resource-constrained
settings. In this paper, we document the development of SmolLM2, a
state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain
strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a
multi-stage training process that mixes web text with specialized math, code,
and instruction-following data. We additionally introduce new specialized
datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing
datasets to be problematically small or low-quality. To inform our design
decisions, we perform both small-scale ablations as well as a manual refinement
process that updates the dataset mixing rates at each stage based on the
performance at the previous stage. Ultimately, we demonstrate that SmolLM2
outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To
facilitate future research on LM development as well as applications of small
LMs, we release both SmolLM2 as well as all of the datasets we prepared in the
course of this project.Summary
AI-Generated Summary