하버드 도서관 소장 자료에서 정제된 정확성과 사용성을 갖춘 242B 토큰 데이터셋: Institutional Books 1.0

초록

대형 언어 모델(LLMs)은 의미 있는 상관관계와 예측을 생성하기 위해 데이터를 사용하여 세상에 대해 학습합니다. 따라서 이러한 모델을 훈련하거나 추론 시 작업을 지원하는 데 사용되는 데이터셋의 성격, 규모, 품질 및 다양성은 모델의 품질에 직접적인 영향을 미칩니다. 다양한 품질의 LLM의 급속한 개발과 채택은 공개적으로 이용 가능한 고품질 훈련 데이터의 부족을 부각시켰으며, 이러한 데이터셋의 관리를 명확한 출처 체인을 갖춘 지속 가능한 관행에 기반을 두는 것이 시급하다는 점을 드러냈습니다. 이를 위해 본 기술 보고서는 2006년부터 하버드 도서관이 참여한 구글 도서 프로젝트를 통해 원본 디지털화된 공개 도메인 도서의 대규모 컬렉션인 Institutional Books 1.0을 소개합니다. 하버드 도서관과 협력하여 우리는 이러한 도서를 추출, 분석, 처리하여 역사적 텍스트의 광범위하게 문서화된 데이터셋으로 구성했습니다. 이 분석은 해당 프로젝트의 일환으로 스캔된 하버드 도서관 컬렉션 전체를 다루며, 원래 250개 이상의 언어로 작성된 1,075,899권의 도서로 총 약 2,500억 개의 토큰으로 구성되었습니다. 이 초기 릴리스의 일환으로, 공개 도메인으로 확인된 983,004권의 도서(242B 토큰)에 대한 OCR 추출 텍스트(원본 및 후처리)와 메타데이터(서지, 출처 및 생성)가 공개되었습니다. 본 보고서는 이 프로젝트의 목표와 방법, 그리고 수행한 분석 결과를 설명하며, 이 역사적 컬렉션을 인간과 기계 모두가 더 쉽게 접근하고 필터링, 읽기, 사용할 수 있도록 하는 데 기여합니다.

English

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

하버드 도서관 소장 자료에서 정제된 정확성과 사용성을 갖춘 242B 토큰 데이터셋: Institutional Books 1.0

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

초록

Support