小模型,大影響:針對低資源語言的小型多語言模型之高效語料與圖基適應
Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages
February 14, 2025
作者: Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann
cs.AI
摘要
低資源語言(LRLs)在自然語言處理(NLP)領域面臨著顯著挑戰,主要由於數據量有限。儘管當前最先進的大型語言模型(LLMs)在處理低資源語言時仍存在困難,但較小的多語言模型(mLMs),如mBERT和XLM-R,因其模型容量更適合低訓練數據量而展現出更大的潛力。本研究系統性地探討了基於適配器的參數高效方法,用於將多語言模型適應於低資源語言,並評估了三種架構:序列瓶頸、可逆瓶頸和低秩適應。利用GlotCC的非結構化文本和ConceptNet的結構化知識,我們展示了小規模適應數據集(例如,最多1 GB的自由文本或幾MB的知識圖譜數據)在內在任務(掩碼語言建模)和外在任務(主題分類、情感分析和命名實體識別)中的性能提升。我們發現,序列瓶頸適配器在語言建模方面表現出色,而可逆瓶頸適配器由於更好的嵌入對齊和更多的參數數量,在下游任務中略微優於其他方法。基於適配器的方法在參數使用量大幅減少的情況下,與全面微調相當或更優,且較小的多語言模型在處理低資源語言時比LLaMA-3、GPT-4和基於DeepSeek-R1的蒸餾模型等大型語言模型更為有效。儘管適應方法提升了性能,但預訓練數據量仍然是主導因素,尤其是對於預訓練覆蓋廣泛的語言而言。
English
Low-resource languages (LRLs) face significant challenges in natural language
processing (NLP) due to limited data. While current state-of-the-art large
language models (LLMs) still struggle with LRLs, smaller multilingual models
(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of
their capacity to low training data sizes. This study systematically
investigates parameter-efficient adapter-based methods for adapting mLMs to
LRLs, evaluating three architectures: Sequential Bottleneck, Invertible
Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and
structured knowledge from ConceptNet, we show that small adaptation datasets
(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains
in intrinsic (masked language modeling) and extrinsic tasks (topic
classification, sentiment analysis, and named entity recognition). We find that
Sequential Bottleneck adapters excel in language modeling, while Invertible
Bottleneck adapters slightly outperform other methods on downstream tasks due
to better embedding alignment and larger parameter counts. Adapter-based
methods match or outperform full fine-tuning while using far fewer parameters,
and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3,
GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves
performance, pre-training data size remains the dominant factor, especially for
languages with extensive pre-training coverage.Summary
AI-Generated Summary