数据混合推断：BPE 分词器对其训练数据有何启示？

摘要

当今最强大语言模型的预训练数据是不透明的。特别是，我们对各个领域或语言所占比例了解甚少。在这项工作中，我们解决了一个名为数据混合推断的任务，旨在揭示训练数据的分布组成。我们引入了一种基于先前被忽视的信息源的新型攻击——字节对编码（BPE）分词器，这是大多数现代语言模型使用的。我们的关键洞察是，由BPE分词器学习到的合并规则有序列表自然地揭示了有关其训练数据中标记频率的信息：第一个合并是最常见的字节对，第二个是在合并第一个标记后最常见的对，依此类推。给定一个分词器的合并列表以及每个感兴趣类别的数据样本，我们制定了一个线性规划，用于解决分词器训练集中每个类别的比例。重要的是，分词器训练数据代表了预训练数据的程度，我们间接了解了有关预训练数据的信息。在受控实验中，我们展示了我们的攻击可以高精度地恢复在已知混合的自然语言、编程语言和数据源上训练的分词器的混合比例。然后，我们将我们的方法应用于最近发布的现成分词器。我们确认了关于这些模型的许多公开披露的信息，并做出了几项新的推断：GPT-4o的分词器比其前身更多语言，使用39%的非英语数据进行训练；Llama3主要用于多语言（48%）的用途，扩展了GPT-3.5的分词器；GPT-3.5和Claude的分词器主要用于代码（约60%）的训练。我们希望我们的工作能揭示当前预训练数据设计实践的一些情况，并激发对语言模型数据混合推断的持续研究。

English

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

数据混合推断：BPE 分词器对其训练数据有何启示？

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

摘要

Support