NGM：一個即插即用、無需訓練的大型語言模型記憶模組

摘要

近期研究引入了条件记忆模块，将知识存储与神经计算解耦，从而能够更直接地访问知识。与依赖动态计算路径的MoE相比，显式查找提供了一种更高效的知识检索机制。然而，这些方法仍然依赖于学习得到的记忆嵌入，需要额外训练并限制了灵活性。为此，我们提出N元记忆（N-gram Memory, NGM），这是一种无需训练、即插即用的模块，由因果N元编码器（Causal N-Gram Encoder）和余弦门控记忆注入器（Cosine-Gated Memory Injector）组成。因果N元编码器直接对骨干模型的预训练词元嵌入取平均来构建N元表示，从而无需从零开始训练独立的N元嵌入。这一设计既不需要额外的记忆表，也不需要检索流程。余弦门控记忆注入器随后使用带ReLU的非参数化余弦门控，将检索到的嵌入调制到上下文表示中。我们在Qwen3系列（0.6B至14B）的八个基准上对NGM进行了评估。NGM将平均性能提升了0.5到1.2个百分点，在代码生成和知识密集型任务上尤其显著（例如，Qwen3-14B在LiveCodeBench上提升+3.0，在GPQA上提升+3.03）。此外，NGM在多模态基准上也提升了性能（例如，Qwen3-VL-2B在MMStar上提升+1.53）。

English

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).