MeKi：基于记忆的专家知识注入——面向高效大语言模型扩展的新方法

摘要

大规模语言模型的扩展通常依赖于增加参数数量或测试时计算量来提升性能。然而由于边缘设备的RAM和NPU资源有限，这些策略在实际部署中并不适用。尽管存在硬件限制，在智能手机等边缘设备上部署高性能LLM对于用户体验仍至关重要。为此，我们提出MeKi（基于内存的专家知识注入）系统，通过存储空间而非浮点运算量来扩展LLM能力。MeKi为每个Transformer层配备令牌级记忆专家，将预存储的语义知识注入生成过程。为弥合训练容量与推理效率之间的差距，我们采用重参数化策略，将训练阶段使用的参数矩阵折叠为紧凑的静态查找表。通过将知识卸载至ROM，MeKi实现了模型容量与计算成本的解耦，且推理延迟零增加。大量实验表明，在相同推理速度下，MeKi显著优于稠密LLM基线，验证了基于内存的扩展范式在端侧LLM部署中的有效性。项目主页位于https://github.com/ningding-o/MeKi。

English

Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at https://github.com/ningding-o/MeKi.

MeKi：基于记忆的专家知识注入——面向高效大语言模型扩展的新方法

MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

摘要

Support