MemoryLLM: トランスフォーマー向けプラグアンドプレイ解釈可能フィードフォワードメモリ

要旨

大規模言語モデル（LLM）におけるトランスフォーマー構成要素の動作原理を理解することは、人工知能における最近の技術進歩の核心をなすため重要である。本研究では、フィードフォワードモジュール（FFN）の解釈可能性に関連する課題を再検討し、FFNを自己注意機構から分離して、分離されたFFNを文脈非依存のトークンレベルの神経検索メモリとして研究することを可能にするMemoryLLMを提案する。具体的には、入力トークンがFFNパラメータ内のメモリ位置にどのようにアクセスするか、および様々な下流タスクにおけるFFNメモリの重要性を調査する。MemoryLLMは、トークン埋め込みを直接使用して自己注意機構から分離してFFNを訓練することで、文脈非依存のFFNを実現する。このアプローチにより、FFNはトークンレベルのルックアップテーブル（ToL）として事前計算可能となり、VRAMとストレージ間のオンデマンド転送を可能にして推論効率をさらに向上させる。また、従来のトランスフォーマー設計とMemoryLLMの中間に位置するFlex-MemoryLLMを導入する。このアーキテクチャは、文脈非依存のトークンレベル埋め込みでFFNを訓練することによって生じる性能差を埋める役割を果たす。

English

Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

MemoryLLM: トランスフォーマー向けプラグアンドプレイ解釈可能フィードフォワードメモリ

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

要旨

Support