MemoryLLM：即插即用式可解释前馈内存模块在Transformer中的应用

摘要

理解Transformer组件在大型语言模型中的运行机制至关重要，这构成了近期人工智能技术突破的核心。本研究重新审视了前馈网络模块可解释性面临的挑战，提出MemoryLLM框架——该框架旨在将前馈网络与自注意力机制解耦，使我们能够将解耦后的前馈网络作为上下文无关的令牌级神经检索内存进行研究。具体而言，我们探究了输入令牌如何访问前馈网络参数中的记忆单元，并分析了前馈网络记忆在不同下游任务中的重要性。MemoryLLM通过直接基于令牌嵌入独立训练前馈网络，实现了上下文无关的前馈网络架构。这种方法使前馈网络可预计算为令牌级查找表，支持在显存与存储设备间按需传输，从而显著提升推理效率。我们还提出Flex-MemoryLLM架构，将其定位在传统Transformer设计与MemoryLLM之间的过渡方案。该架构通过使用上下文无关的令牌嵌入训练前馈网络，有效弥合了由此产生的性能差距。

English

Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

MemoryLLM：即插即用式可解释前馈内存模块在Transformer中的应用

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

摘要

Support