大型語言模型中用於跨語言遷移學習的動態數據採樣器

摘要

大型語言模型（LLMs）由於廣泛的應用而在自然語言處理（NLP）領域引起了重大關注。然而，為非英語語言訓練LLMs存在著重大挑戰，主要是由於難以獲取大規模語料庫和必要的計算資源。本文提出了ChatFlow，一種基於跨語言轉移的LLM，以成本效益的方式訓練大型中文語言模型來應對這些挑戰。我們採用中文、英文和平行語料庫的混合來持續訓練LLaMA2模型，旨在對齊跨語言表示並促進知識轉移，特別針對中文語言模型。此外，我們使用動態數據取樣器逐漸將模型從無監督預訓練過渡到監督微調。實驗結果表明，我們的方法加速了模型收斂並取得了優異的性能。我們在流行的中文和英文基準上評估了ChatFlow，結果表明它優於其他在LLaMA-2-7B上後訓練的中文模型。

English

Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

大型語言模型中用於跨語言遷移學習的動態數據採樣器

Dynamic data sampler for cross-language transfer learning in large language models

摘要

Support