UltraMemV2: 우수한 장기 문맥 학습을 지원하며 1200억 파라미터로 확장 가능한 메모리 네트워크

초록

전문가 혼합(Mixture of Experts, MoE) 모델은 매개변수의 일부만 활성화함으로써 뛰어난 효율성을 달성하지만, 추론 과정에서 높은 메모리 접근 비용이 발생하는 문제가 있습니다. 메모리 계층 아키텍처는 매우 적은 메모리 접근으로 매력적인 대안을 제공하지만, UltraMem과 같은 이전 시도들은 2-전문가 MoE 모델의 성능에만 근접했으며, 최신 8-전문가 구성에 비해 크게 뒤떨어졌습니다. 우리는 이러한 성능 격차를 해소한 재설계된 메모리 계층 아키텍처인 UltraMemV2를 제시합니다. 우리의 접근 방식은 다섯 가지 주요 개선 사항을 도입합니다: 모든 트랜스포머 블록에 메모리 계층을 통합, 단일 선형 투영으로 값 확장을 단순화, PEER에서 채택한 FFN 기반 값 처리, 원칙적인 매개변수 초기화 구현, 그리고 메모리 대 FFN 계산 비율 재조정 등입니다. 광범위한 평가를 통해 UltraMemV2가 동일한 계산 및 매개변수 조건에서 8-전문가 MoE 모델과 성능을 동등하게 달성하지만, 메모리 접근은 상당히 낮음을 입증했습니다. 특히, UltraMemV2는 메모리 집약적인 작업에서 우수한 성능을 보이며, 장문 맥락 기억에서 +1.6점, 다중 라운드 기억에서 +6.2점, 컨텍스트 내 학습에서 +7.9점의 향상을 달성했습니다. 우리는 총 120B 매개변수 중 2.5B 활성 매개변수를 가진 모델로 대규모 검증을 수행했으며, 활성화 밀도가 전체 희소 매개변수 수보다 성능에 더 큰 영향을 미친다는 것을 확인했습니다. 우리의 연구는 메모리 계층 아키텍처를 최신 MoE 모델과 동등한 성능 수준으로 끌어올려, 효율적인 희소 계산을 위한 강력한 대안을 제시합니다.

English

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

UltraMemV2: 우수한 장기 문맥 학습을 지원하며 1200억 파라미터로 확장 가능한 메모리 네트워크

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

초록

Support