MeKi：基於記憶的專家知識注入技術——實現大型語言模型高效擴展的新途徑

摘要

大型語言模型（LLM）的擴展通常依賴於增加參數數量或測試時計算量以提升效能。然而，由於邊緣設備的記憶體與神經處理單元資源有限，這些策略在實際部署時難以實施。儘管存在硬體限制，在智慧型手機等邊緣設備上部署高效能LLM對於用戶體驗仍至關重要。為解決此問題，我們提出MeKi（基於記憶體的專家知識注入）系統，透過儲存空間而非浮點運算次數來擴展LLM能力。MeKi為每個Transformer層配備詞元級記憶體專家，將預存語義知識注入生成過程。為彌合訓練能力與推理效率間的差距，我們採用重參數化策略，將訓練期間使用的參數矩陣折疊為緊湊的靜態查找表。透過將知識卸載至唯讀記憶體，MeKi實現了模型能力與計算成本的解耦，且推理延遲零開銷。大量實驗表明，在相同推理速度下，MeKi顯著超越稠密LLM基線模型，驗證了基於記憶體的擴展範式對於端側LLM的有效性。項目主頁位於 https://github.com/ningding-o/MeKi。

English

Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at https://github.com/ningding-o/MeKi.

MeKi：基於記憶的專家知識注入技術——實現大型語言模型高效擴展的新途徑

MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

摘要

Support