UltraMemV2:支持1200亿参数扩展的记忆网络,具备卓越的长上下文学习能力
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
August 26, 2025
作者: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao
cs.AI
摘要
尽管专家混合(MoE)模型通过仅激活部分参数实现了显著的效率提升,但其在推理过程中面临高内存访问成本的挑战。内存层架构以其极低的内存访问需求提供了诱人的替代方案,然而先前如UltraMem等尝试仅能与2专家MoE模型性能相当,远未达到顶尖8专家配置的水平。我们推出了UltraMemV2,一种重新设计的内存层架构,成功弥合了这一性能差距。我们的方法引入了五项关键改进:将内存层整合至每个Transformer模块中,通过单一线性投影简化值扩展,采用源自PEER的基于FFN的值处理机制,实施原则性的参数初始化,以及重新平衡内存与FFN的计算比例。经过广泛评估,我们证明UltraMemV2在相同计算量和参数规模下,实现了与8专家MoE模型相当的性能,同时显著降低了内存访问。尤为突出的是,UltraMemV2在内存密集型任务上展现出卓越性能,在长上下文记忆任务上提升1.6分,多轮记忆任务上提升6.2分,上下文学习任务上提升7.9分。我们通过激活参数高达2.5B(总参数120B)的大规模模型验证了该方法的有效性,并确认激活密度对性能的影响大于稀疏参数总量。我们的工作使内存层架构达到了与最先进MoE模型同等的性能水平,为高效稀疏计算提供了一个极具吸引力的替代方案。
English
While Mixture of Experts (MoE) models achieve remarkable efficiency by
activating only subsets of parameters, they suffer from high memory access
costs during inference. Memory-layer architectures offer an appealing
alternative with very few memory access, but previous attempts like UltraMem
have only matched the performance of 2-expert MoE models, falling significantly
short of state-of-the-art 8-expert configurations. We present UltraMemV2, a
redesigned memory-layer architecture that closes this performance gap. Our
approach introduces five key improvements: integrating memory layers into every
transformer block, simplifying value expansion with single linear projections,
adopting FFN-based value processing from PEER, implementing principled
parameter initialization, and rebalancing memory-to-FFN computation ratios.
Through extensive evaluation, we demonstrate that UltraMemV2 achieves
performance parity with 8-expert MoE models under same computation and
parameters but significantly low memory access. Notably, UltraMemV2 shows
superior performance on memory-intensive tasks, with improvements of +1.6
points on long-context memorization, +6.2 points on multi-round memorization,
and +7.9 points on in-context learning. We validate our approach at scale with
models up to 2.5B activated parameters from 120B total parameters, and
establish that activation density has greater impact on performance than total
sparse parameter count. Our work brings memory-layer architectures to
performance parity with state-of-the-art MoE models, presenting a compelling
alternative for efficient sparse computation.