웹 규모 LLM 사전 학습 데이터에서의 내러티브 콘텐츠 특성 분석

초록

웹 규모 LLM 사전 학습 코퍼스의 서사적 구성은 서사가 인간 의사소통의 근본적인 양식임에도 불구하고 아직까지 거의 탐구되지 않은 분야이다. 본 연구는 3조 개의 토큰으로 구성된 오픈 사전 학습 코퍼스인 Dolma를 대상으로 서사적 특징에 대한 최초의 세밀한 분석을 제시한다. 서사 이론을 바탕으로, 세 가지 핵심 서사 요소(행위 주체, 배경, 사건)를 11개의 해석 가능한 차원으로 구현하는 프레임워크를 설계하였다. 다양한 400개 구절을 샘플링하여 주석을 단 후, 세밀한 서사 예측을 위한 RoBERTa 기반 모델인 NarraBERT를 미세 조정하고 검증하였다. NarraBERT를 300만 개의 구절에 적용하여 새로운 데이터셋인 NarraDolma를 구축하였다. 연구 결과, (i) 서사 구조는 극도로 이질적인 데이터 전반에 걸쳐 대규모로 측정 가능하며, (ii) 웹 텍스트의 기저에는 연속적이고 다차원적인 서사 구조가 존재하고, (iii) 서사 특성은 사전 학습 출처와 주제에 따라 불균등하게 분포하며, 현재의 큐레이션 관행은 이를 측정하거나 고려하지 않음을 발견하였다. 본 연구에서 제시하는 프레임워크, 데이터셋, 분석은 LLM 사전 학습 데이터에서 서사 특성이 어떻게 분포하는지 이해하고, 데이터 구성이 서사 추론 과제에 미치는 영향을 연구하기 위한 기초를 제공한다. NarraDolma와 NarraBERT를 공개한다.

English

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.