MobileLLM：针对设备端使用情况优化的亚十亿参数语言模型

摘要

本文讨论了移动设备上高效大型语言模型（LLMs）的增长需求，这是由于云成本和延迟问题不断增加。我们专注于设计具有不到十亿参数的高质量LLMs，这是移动部署的实际选择。与普遍认为数据和参数数量在确定模型质量方面起着关键作用的观点相反，我们的研究强调了对于小于十亿规模LLMs，模型架构的重要性。利用深度和瘦身架构，结合嵌入共享和分组查询注意机制，我们建立了一个强大的基准网络，称为MobileLLM，比之前的125M/350M最先进模型分别提高了2.7%/4.3%的准确度。此外，我们提出了一种立即的分块权重共享方法，不增加模型大小，仅有轻微的延迟开销。由此产生的模型，称为MobileLLM-LS，比MobileLLM 125M/350M进一步提高了0.7%/0.8%的准确度。此外，MobileLLM模型系列在聊天基准测试中相对于之前的小于十亿模型显示出显著改进，并在API调用任务中表现出与LLaMA-v2 7B接近的正确性，突显了小型模型在常见设备上使用情况下的能力。

English

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

MobileLLM：针对设备端使用情况优化的亚十亿参数语言模型

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

摘要

Support