SmolLM2：當小型模型變得強大——小型語言模型的資料中心訓練

摘要

儘管大型語言模型促進了人工智慧許多應用的突破，但其固有的龐大尺寸使其在資源受限的環境中難以部署並且需要大量計算資源。本文記錄了SmolLM2的開發，這是一個最先進的「小型」（17億參數）語言模型（LM）。為了達到優異的性能，我們對SmolLM2進行了超級訓練，使用了約11兆標記數據，採用了混合網絡文本與專業數學、代碼和指令跟隨數據的多階段訓練過程。我們還在現有數據集過小或質量不佳的階段引入了新的專業數據集（FineMath、Stack-Edu和SmolTalk）。為了指導我們的設計決策，我們進行了小規模消融實驗，並進行了手動精煉過程，根據前一階段的性能更新每個階段的數據集混合比率。最終，我們展示了SmolLM2優於其他最近的小型LM，包括Qwen2.5-1.5B和Llama3.2-1B。為了促進LM開發以及小型LM應用的未來研究，我們釋出了SmolLM2以及在項目進行過程中準備的所有數據集。

English

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

SmolLM2：當小型模型變得強大——小型語言模型的資料中心訓練

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

摘要

Support