NGM：一种即插即用、无需训练的大型语言模型记忆模块

摘要

近期研究引入了条件记忆模块，将知识存储与神经计算解耦，从而实现更直接的知识访问。与依赖动态计算路径的混合专家模型相比，显式查找提供了更高效的知识检索机制。然而，这些方法仍依赖于学习到的记忆嵌入，需要额外训练且灵活性受限。为解决这一问题，我们提出N-gram记忆模块——一种无需训练、即插即用的模块，由因果N-gram编码器和余弦门控记忆注入器组成。因果N-gram编码器直接对主干模型的预训练词元嵌入进行平均，构建N-gram表示，从而无需从头训练独立的N-gram嵌入。该设计既不需要额外的记忆表，也不需要检索流水线。余弦门控记忆注入器随后使用带ReLU的非参数余弦门控，将检索到的嵌入调制到上下文表示中。我们在Qwen3系列（0.6B至14B参数规模）的八个基准上评估了NGM。NGM将平均性能提升0.5至1.2个点，在代码生成和知识密集型任务上提升尤为显著（例如，Qwen3-14B在LiveCodeBench上提升+3.0，在GPQA上提升+3.03）。此外，NGM在多模态基准上也提升了性能（例如，Qwen3-VL-2B在MMStar上提升+1.53）。

English

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).