데이터 혼합 추론: BPE 토크나이저가 훈련 데이터에 대해 무엇을 드러내는가?

초록

오늘날 가장 강력한 언어 모델의 사전 학습 데이터는 불투명합니다. 특히, 다양한 도메인이나 언어가 어떤 비율로 포함되어 있는지에 대해서는 거의 알려져 있지 않습니다. 본 연구에서는 우리가 데이터 혼합 추론(data mixture inference)이라고 부르는 작업을 다루며, 이는 학습 데이터의 분포적 구성을 밝히는 것을 목표로 합니다. 우리는 이전에 간과되었던 정보원인 바이트 페어 인코딩(BPE) 토크나이저를 기반으로 한 새로운 공격 방법을 소개합니다. BPE 토크나이저는 현대 언어 모델의 대다수가 사용하고 있습니다. 우리의 핵심 통찰은 BPE 토크나이저가 학습한 병합 규칙의 순서화된 목록이 학습 데이터의 토큰 빈도에 대한 정보를 자연스럽게 드러낸다는 것입니다: 첫 번째 병합은 가장 흔한 바이트 쌍이고, 두 번째는 첫 번째 토큰을 병합한 후 가장 흔한 쌍이며, 이런 식으로 계속됩니다. 관심 있는 각 카테고리에 대한 데이터 샘플과 함께 토크나이저의 병합 목록이 주어지면, 우리는 토크나이저의 학습 세트에서 각 카테고리의 비율을 계산하는 선형 프로그램을 공식화합니다. 중요한 점은, 토크나이저 학습 데이터가 사전 학습 데이터를 대표하는 한, 우리는 간접적으로 사전 학습 데이터에 대해 학습할 수 있다는 것입니다. 통제된 실험에서, 우리는 우리의 공격이 자연어, 프로그래밍 언어, 데이터 소스의 알려진 혼합물로 학습된 토크나이저에 대해 높은 정밀도로 혼합 비율을 복구함을 보여줍니다. 그런 다음 우리는 최근 언어 모델과 함께 공개된 상용 토크나이저에 우리의 접근 방식을 적용합니다. 우리는 이러한 모델에 대해 공개적으로 알려진 많은 정보를 확인하고, 몇 가지 새로운 추론을 합니다: GPT-4o의 토크나이저는 이전 모델보다 훨씬 더 다국어적이며, 39%의 비영어 데이터로 학습되었습니다; Llama3는 GPT-3.5의 토크나이저를 주로 다국어(48%) 사용을 위해 확장했습니다; GPT-3.5와 Claude의 토크나이저는 주로 코드(~60%)로 학습되었습니다. 우리의 연구가 현재의 사전 학습 데이터 설계 관행에 대한 통찰을 제공하고, 언어 모델을 위한 데이터 혼합 추론 연구가 계속되기를 바랍니다.

English

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

데이터 혼합 추론: BPE 토크나이저가 훈련 데이터에 대해 무엇을 드러내는가?

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

초록

Support