MeKi: 効率的なLLMスケーリングのためのメモリベース専門知識注入

要旨

大規模言語モデル（LLM）のスケーリングは、通常、パラメータ数や推論時の計算量を増加させることで性能向上を図る。しかし、これらの戦略はRAMやNPUリソースが限られるエッジデバイスへの展開には非現実的である。ハードウェア制約があるにも関わらず、スマートフォンなどのエッジデバイスに高性能なLLMを展開することは、ユーザーエクスペリエンスにおいて極めて重要である。この課題解決に向け、我々はFLOPsではなく記憶容量によるLLM能力のスケーリングを実現する新システムMeKi（Memory-based Expert Knowledge Injection）を提案する。MeKiは各Transformer層にトークンレベルのメモリエキスパートを装備し、事前に保存された意味知識を生成プロセスに注入する。学習時の容量と推論時の効率性のギャップを埋めるため、再パラメータ化戦略を用いて学習時に使用するパラメータ行列をコンパクトな静的なルックアップテーブルに変換する。知識をROMにオフロードすることで、MeKiはモデル能力と計算コストを分離し、推論遅延のオーバーヘッドをゼロに抑える。大規模な実験により、MeKiが同一の推論速度を持つ密なLLMベースラインを大幅に上回ることを実証し、オンデバイスLLMにおけるメモリベースのスケーリングパラダイムの有効性を検証した。プロジェクトホームページはhttps://github.com/ningding-o/MeKi。

English

Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at https://github.com/ningding-o/MeKi.

MeKi: 効率的なLLMスケーリングのためのメモリベース専門知識注入

MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

要旨

Support