ChatPaper.aiChatPaper

UltraMemV2:記憶體網路擴展至1200億參數,實現卓越的長上下文學習能力

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

August 26, 2025
作者: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao
cs.AI

摘要

儘管專家混合(MoE)模型通過僅激活部分參數實現了顯著的效率,但這些模型在推理過程中面臨著高內存訪問成本的問題。內存層架構提供了一種具有極少內存訪問的吸引人替代方案,但之前的嘗試如UltraMem僅能匹配2專家MoE模型的性能,遠遠落後於最先進的8專家配置。我們提出了UltraMemV2,這是一種重新設計的內存層架構,彌補了這一性能差距。我們的方法引入了五項關鍵改進:將內存層集成到每個Transformer模塊中,通過單一線性投影簡化值擴展,採用來自PEER的基於FFN的值處理,實施原則性的參數初始化,以及重新平衡內存與FFN的計算比例。通過廣泛的評估,我們證明UltraMemV2在相同計算和參數條件下實現了與8專家MoE模型的性能持平,但顯著降低了內存訪問。值得注意的是,UltraMemV2在內存密集型任務上表現出優異性能,在長上下文記憶任務上提升了1.6分,在多輪記憶任務上提升了6.2分,在上下文學習任務上提升了7.9分。我們在規模上驗證了我們的方法,模型激活參數高達2.5B,總參數達120B,並確定了激活密度對性能的影響大於總稀疏參數數量。我們的工作使內存層架構達到了與最先進MoE模型相當的性能,為高效的稀疏計算提供了一個引人注目的替代方案。
English
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.
PDF151August 27, 2025