通時的な言語変化発見のための言語モデルの事前学習

要旨

大規模言語モデル（LLM）は、科学的発見のツールとしての可能性を示してきた。これにより、歴史言語学や文学研究などの人文分野での利用に対する関心が高まっている。これらの分野では、ジャンルやより厳密な時代区分に基づいて議論を構築することが多い。特定のドメインに推論を制限するためにファインチューニングやモデル編集が試みられているが、真に保証されるのはドメイン限定の事前学習であると我々は主張する。これは通常、データと計算リソースを大量に消費する提案である。我々は、効率的な事前学習技術が、手動での検査には大きすぎるが「典型的な」LLMアプローチには小さすぎるコーパス上でも有用なモデルを生成できることを示す。時間的に分割されたデータセットを取得するために、新しい日付属性パイプラインを採用し、5つの1000万語スライスからなるデータセットを構築した。これらのコーパスセグメントに対して、効率的な事前学習とLlama3-8Bパラメータの効率的なファインチューニングを行い、対応する5モデルのバッテリーを訓練した。事前学習モデルは、ファインチューニングされたベースラインよりも訓練が速く、コーパスの歴史的区分をより尊重することがわかった。歴史的包括性よりも速度と精度を重視することで、対象分野における仮説発見と検証のための新しいアプローチが可能となる。通時言語学をテストベッドとして取り上げ、我々の手法が、大量の語彙変化、非語彙的（文法的および形態的）変化、語義の導入/廃用など、多様な現象の検出を可能にすることを示す。我々は、最小限の適応で他の対象分野にアプローチを拡張できる、すぐに使用可能なパイプラインを提供する。

English

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

通時的な言語変化発見のための言語モデルの事前学習

Pretraining Language Models for Diachronic Linguistic Change Discovery

要旨

Support