Institutional Books 1.0：ハーバード図書館コレクションからの242Bトークンデータセット、精度と有用性を向上させたもの

要旨

大規模言語モデル（LLMs）は、世界について学び、意味のある相関関係や予測を生成するためにデータを使用します。そのため、これらのモデルを訓練するために使用されるデータセット、または推論時にその作業を支援するデータセットの性質、規模、品質、多様性は、モデルの品質に直接的な影響を及ぼします。品質の異なるLLMsの急速な開発と採用により、公開されている高品質な訓練データの不足が浮き彫りとなり、これらのデータセットの管理を明確なプロヴェナンスチェーンに基づいた持続可能な実践に根ざす必要性が明らかになりました。この目的のために、本技術報告書では、Institutional Books 1.0を紹介します。これは、2006年に始まったハーバード図書館のGoogle Booksプロジェクトへの参加を通じてデジタル化されたパブリックドメインの書籍の大規模なコレクションです。ハーバード図書館と協力して、これらの書籍を抽出、分析、処理し、歴史的テキストの詳細にわたるデータセットにまとめました。この分析は、ハーバード図書館のコレクション全体をカバーしており、250以上の異なる言語で書かれた1,075,899冊の書籍、総計約2500億トークンに及びます。この初期リリースの一部として、パブリックドメインと特定された983,004冊の書籍（242Bトークン）のOCR抽出テキスト（オリジナルおよび後処理済み）およびメタデータ（書誌情報、ソース、生成されたもの）が公開されています。本報告書では、このプロジェクトの目標と方法、および実施した分析の結果について説明し、この歴史的コレクションを人間と機械の両方にとってよりアクセスしやすく、フィルタリング、読み取り、使用しやすいものにすることを目指しています。

English

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

Institutional Books 1.0：ハーバード図書館コレクションからの242Bトークンデータセット、精度と有用性を向上させたもの

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

要旨

Support