MobileLLM：針對設備上使用情境優化的次十億參數語言模型

摘要

本文探討了移動設備上高效大型語言模型（LLMs）的日益增長需求，這是由於雲成本和延遲問題不斷增加。我們專注於設計具有不到十億參數的高質量LLMs，這是移動部署的實際選擇。與主流觀點相反，主張數據和參數數量在確定模型質量方面的關鍵作用，我們的研究強調了對於次十億規模LLMs，模型架構的重要性。通過利用深度和窄度結構，結合嵌入共享和分組查詢注意機制，我們建立了一個強大的基準網絡，稱為MobileLLM，比前125M/350M最先進模型實現了顯著的2.7%/4.3%的準確度提升。此外，我們提出了一種立即的塊狀權重共享方法，不增加模型大小，僅有輕微的延遲開銷。結果模型，稱為MobileLLM-LS，展示了比MobileLLM 125M/350M進一步的0.7%/0.8%的準確度提升。此外，MobileLLM模型系列在聊天基準測試中相比以前的次十億模型有顯著改進，並在API調用任務中展示了與LLaMA-v2 7B接近的正確性，突出了小型模型在常見設備使用情況下的能力。

English

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

MobileLLM：針對設備上使用情境優化的次十億參數語言模型

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

摘要

Support