UnifiedCrawl：低リソース言語向けLLMの手頃な適応のための集約されたCommon Crawl

要旨

低リソース言語において、大規模言語モデル（LLMs）は訓練データの制約から性能が低下します。私たちは、Common Crawlコーパス全体から低リソース言語のテキストデータを効率的に収集する方法を提案します。UnifiedCrawlというアプローチは、最小限の計算リソースを使用してCommon Crawlをフィルタリングし抽出し、これまでの利用可能なソースよりもはるかに大きな単言語データセットを生成します。私たちは、このデータを活用して、効率的なアダプター手法（QLoRA）を用いて多言語LLMsを微調整することで、低リソース言語における性能を大幅に向上させ、VRAMの使用量を最小限に抑えることを示します。実験では、言語モデリングの困難さ（perplexity）における大幅な改善と、フューショット・プロンプトスコアの増加が示されました。私たちの研究と公開されたソースコードは、消費者向けハードウェアを使用して低リソース言語のLLMsを改善する手頃な方法を提供します。私たちのソースコードはこちらで入手可能です：https://github.com/bethelmelesse/unifiedcrawl.

English

Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tuning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. Our source code is available here at https://github.com/bethelmelesse/unifiedcrawl.

UnifiedCrawl：低リソース言語向けLLMの手頃な適応のための集約されたCommon Crawl

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

要旨

Support