网络规模大语言模型预训练数据中的叙事内容刻画

摘要

网络规模的大语言模型预训练语料库的叙事构成在很大程度上仍未得到充分探索，尽管叙事是人类交流的基本模式。我们首次对包含3万亿词元（tokens）的开放预训练语料库Dolma中的叙事特征进行了细粒度研究。基于叙事理论，我们设计了一个涵盖三个核心叙事要素（能动性、背景和事件）的框架，并将其操作化为11个可解释维度。在采样并标注了400篇多样化的文本片段后，我们微调并验证了NarraBERT——一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个文本片段，创建了新数据集NarraDolma。我们的发现如下：(i) 叙事结构可在极端异构的数据中以大规模方式测量；(ii) 我们揭示了网络文本背后存在一个连续的多维叙事结构；(iii) 叙事质量在预训练数据源和主题间分布不均，而当前的数据筛选实践既未测量也未考虑这种不均性。我们的框架、数据集和分析为理解叙事质量如何分布在大语言模型预训练数据中，以及研究数据组成如何影响叙事推理任务奠定了基础。我们公开发布了NarraDolma和NarraBERT。

English

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.