資料混合推論：BPE 分詞器對其訓練資料的揭示

摘要

當今最強大語言模型的預訓練數據是不透明的。特別是，對於各種領域或語言在其中所佔比例知之甚少。在這項研究中，我們處理一個任務，我們稱之為數據混合推斷，旨在揭示訓練數據的分佈組成。我們引入了一種基於先前被忽視的信息來源的新型攻擊——字節對編碼（BPE）分詞器，這是現代大多數語言模型使用的。我們的關鍵見解是，由BPE分詞器學習的合併規則的有序列表自然地揭示了有關其訓練數據中標記頻率的信息：第一個合併是最常見的字節對，第二個是在合併第一個標記後最常見的對，依此類推。給定一個分詞器的合併列表以及每個感興趣類別的數據樣本，我們制定了一個線性程序，用於解決分詞器訓練集中每個類別的比例。重要的是，分詞器訓練數據代表預訓練數據的程度，我們間接地了解預訓練數據。在受控實驗中，我們展示了我們的攻擊對於已知混合自然語言、編程語言和數據來源的分詞器具有高精度的恢復混合比率。然後，我們將我們的方法應用於最近發布的現成分詞器。我們確認了關於這些模型的許多公開披露的信息，並且還做出了幾個新的推斷：GPT-4o的分詞器比其前身更多語言化，使用了39%的非英語數據；Llama3主要用於多語言（48%）使用，擴展了GPT-3.5的分詞器；GPT-3.5和Claude的分詞器主要訓練於代碼（約60%）。我們希望我們的工作能為預訓練數據的當前設計實踐提供一些啟示，並激發對於語言模型的數據混合推斷的持續研究。

English

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

資料混合推論：BPE 分詞器對其訓練資料的揭示

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

摘要

Support