小型モデル、大きなインパクト：低リソース言語のための効率的なコーパスとグラフベースの小型多言語言語モデルの適応

要旨

低リソース言語（LRLs）は、データが限られているため、自然言語処理（NLP）において大きな課題に直面しています。現在の最先端の大規模言語モデル（LLMs）はLRLsに対して依然として苦戦していますが、mBERTやXLM-Rのような小規模な多言語モデル（mLMs）は、その容量が低いトレーニングデータサイズに適しているため、より有望です。本研究では、mLMsをLRLsに適応させるためのパラメータ効率の良いアダプタベースの手法を体系的に調査し、Sequential Bottleneck、Invertible Bottleneck、およびLow-Rank Adaptationという3つのアーキテクチャを評価します。GlotCCからの非構造化テキストとConceptNetからの構造化知識を使用して、小規模な適応データセット（例えば、最大1GBのフリーテキストまたは数MBの知識グラフデータ）が、内在的タスク（マスク言語モデリング）および外在的タスク（トピック分類、感情分析、固有表現認識）において改善をもたらすことを示します。Sequential Bottleneckアダプタは言語モデリングにおいて優れており、Invertible Bottleneckアダプタは、より良い埋め込みの整合性とより多くのパラメータ数により、下流タスクで他の手法をわずかに上回ります。アダプタベースの手法は、はるかに少ないパラメータを使用しながら、完全なファインチューニングと同等またはそれ以上の性能を発揮し、LLaMA-3、GPT-4、DeepSeek-R1ベースの蒸留モデルなどの大規模LLMsよりも、LRLsに対してより効果的であることが証明されました。適応は性能を向上させますが、特に広範な事前学習カバレッジを持つ言語では、事前学習データのサイズが依然として支配的な要因です。

English

Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.

小型モデル、大きなインパクト：低リソース言語のための効率的なコーパスとグラフベースの小型多言語言語モデルの適応

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

要旨

Support