データ混合推論：BPEトークナイザーはその訓練データについて何を明らかにするか？

要旨

今日の最強の言語モデルの事前学習データは不透明である。特に、様々なドメインや言語がどのような割合で含まれているかはほとんど知られていない。本研究では、学習データの分布構成を明らかにすることを目的とした「データ混合推論」というタスクに取り組む。我々は、現代の言語モデルの大多数が使用するバイトペアエンコーディング（BPE）トークナイザーという、これまで見過ごされてきた情報源に基づく新たな攻撃手法を提案する。鍵となる洞察は、BPEトークナイザーが学習するマージルールの順序付きリストが、その学習データにおけるトークンの頻度に関する情報を自然に明らかにするということである。最初のマージは最も頻度の高いバイトペアであり、2番目のマージは最初のトークンをマージした後の最も頻度の高いペア、といった具合である。トークナイザーのマージリストと、各カテゴリのデータサンプルが与えられた場合、我々は線形計画法を定式化し、トークナイザーの学習セットにおける各カテゴリの割合を求める。重要な点として、トークナイザーの学習データが事前学習データを代表している限り、我々は間接的に事前学習データについて学ぶことができる。制御された実験において、我々の攻撃手法が、自然言語、プログラミング言語、データソースの既知の混合で訓練されたトークナイザーに対して、高い精度で混合比率を復元することを示す。次に、我々のアプローチを最近の言語モデルに付属する市販のトークナイザーに適用する。これらのモデルに関する多くの公表情報を確認し、さらにいくつかの新しい推論を行う：GPT-4oのトークナイザーはその前身よりもはるかに多言語的で、39%の非英語データで訓練されている；Llama3はGPT-3.5のトークナイザーを主に多言語（48%）使用のために拡張している；GPT-3.5とClaudeのトークナイザーは主にコード（約60%）で訓練されている。我々の研究が、現在の事前学習データの設計慣行に光を当て、言語モデルのデータ混合推論に関する研究の継続を促すことを願っている。

English

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

データ混合推論：BPEトークナイザーはその訓練データについて何を明らかにするか？

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

要旨

Support