MemoryLLM: Memoria Feed-Forward Interpretabile Plug-n-Play per Trasformatori

Abstract

Comprendere il funzionamento dei componenti transformer nei modelli linguistici di grandi dimensioni (LLM) è fondamentale, poiché costituisce il nucleo dei recenti progressi tecnologici nell'intelligenza artificiale. In questo lavoro, esaminiamo nuovamente le sfide associate all'interpretabilità dei moduli feed-forward (FFN) e proponiamo MemoryLLM, che mira a disaccoppiare gli FFN dall'auto-attenzione e ci permette di studiare gli FFN disaccoppiati come una memoria neurale di recupero contesto-indipendente e basata sui token. Nel dettaglio, investigiamo come i token di input accedono alle locazioni di memoria all'interno dei parametri degli FFN e l'importanza della memoria degli FFN in diversi task downstream. MemoryLLM realizza FFN contesto-indipendenti addestrandoli isolatamente dall'auto-attenzione, utilizzando direttamente gli embedding dei token. Questo approccio consente di pre-calcolare gli FFN come lookup basati sui token (ToL), permettendo un trasferimento on-demand tra la VRAM e lo storage, migliorando inoltre l'efficienza inferenziale. Introduciamo anche Flex-MemoryLLM, posizionandolo tra un design transformer convenzionale e MemoryLLM. Questa architettura colma il divario prestazionale causato dall'addestramento degli FFN con embedding contesto-indipendenti basati sui token.

English

Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

MemoryLLM: Memoria Feed-Forward Interpretabile Plug-n-Play per Trasformatori

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

Abstract

Support