ChatPaper.aiChatPaper

通过习得嵌入传播促进大语言模型的俄语适应

Facilitating large language model Russian adaptation with Learned Embedding Propagation

December 30, 2024
作者: Mikhail Tikhomirov, Daniil Chernyshev
cs.AI

摘要

大型语言模型(LLM)技术的飞速发展催生了功能强大的开源指令调优模型,其文本生成质量已媲美GPT-4等顶尖模型。尽管此类模型的出现加速了LLM技术在敏感信息环境中的应用,但模型作者并未公开实现结果所需的训练数据,导致研究成果具有模型排他性。由于这些开源模型具备多语言特性,训练特定语言LLM的收益随之降低——提升推理计算效率成为这种高成本操作唯一确定的优势。而词汇扩展及后续持续预训练等更具成本效益的方案,也因缺乏高质量指令调优数据的支持而受限,须知这类数据正是决定LLM任务解决能力的关键因素。 为突破现有局限并降低语言适配流程的成本,我们提出嵌入传播学习法(LEP)。与现有方法不同,本方法通过新型自适应嵌入传播机制强化现有LLM知识体系,对模型原有知识影响极小,因而显著降低训练数据量需求。该技术可跳过指令调优步骤,直接将新语言知识植入现有指令调优模型。我们在LLaMa-3-8B和Mistral-7B上进行了四项俄语词汇适配实验,结果表明LEP与传统指令调优方法性能相当,达到与OpenChat 3.5和LLaMa-3-8B-Instruct可比拟的水平,且通过自校准与持续调优能进一步提升任务解决能力。
English
Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.
PDF182December 31, 2024