MemoryLLM: Transformer를 위한 플러그 앤 플레이 방식의 해석 가능한 피드포워드 메모리

초록

트랜스포머 구성 요소가 대규모 언어 모델(LLM)에서 어떻게 작동하는지 이해하는 것은 인공 지능의 최근 기술 발전의 핵심에 있기 때문에 중요합니다. 본 연구에서는 피드포워드 모듈(FFN)의 해석 가능성과 관련된 과제를 재검토하고, FFN을 자기 주의(self-attention)로부터 분리하여 분리된 FFN을 맥락 독립적인 토큰 단위 신경 검색 메모리로 연구할 수 있게 하는 MemoryLLM을 제안합니다. 구체적으로, 우리는 입력 토큰이 FFN 매개변수 내의 메모리 위치에 어떻게 접근하는지와 다양한 하위 작업에서 FFN 메모리의 중요성을 조사합니다. MemoryLLM은 토큰 임베딩을 직접 사용하여 자기 주의와 분리된 상태에서 FFN을 독립적으로 학습시킴으로써 맥락 독립적인 FFN을 구현합니다. 이 접근 방식은 FFN을 토큰 단위 조회 테이블(ToL)로 사전 계산할 수 있게 하여 VRAM과 저장 장치 간의 온디맨드 전송을 가능하게 하고, 추론 효율을 추가로 향상시킵니다. 또한 우리는 기존 트랜스포머 설계와 MemoryLLM 사이에 위치하는 Flex-MemoryLLM을 소개합니다. 이 아키텍처는 맥락 독립적인 토큰 단위 임베딩으로 FFN을 훈련시킴으로써 발생하는 성능 격차를 해소합니다.

English

Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

MemoryLLM: Transformer를 위한 플러그 앤 플레이 방식의 해석 가능한 피드포워드 메모리

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

초록

Support