每字皆關鍵:大型語言模型中1,600萬超長上下文泛化能力的實現
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
November 28, 2025
作者: Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li
cs.AI
摘要
本研究探討建構「具記憶能力機器」的挑戰,將長期記憶問題定義為高效超長上下文建模的課題。我們主張此需具備三項關鍵特性:稀疏性、隨機存取靈活性及長度泛化能力。為解決超長上下文建模問題,我們採用分層稀疏注意力機制——一種能同時滿足所有三項特性的新穎注意力架構。通過將HSA整合至Transformer架構中,我們構建出HSA-UltraLong模型:這是一個擁有80億參數的混合專家模型,基於超過8兆詞元訓練而成,並在領域內外不同長度的上下文任務中進行嚴格評估,以驗證其處理超長上下文的能力。實驗結果表明,我們的模型在領域內長度任務上與全注意力基線模型表現相當,同時在長達1600萬詞元的上下文檢索任務中,多數項目準確率超過90%。本報告闡述了實驗發現與待解難題,為超長上下文建模的未來研究奠定基礎。
English
This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: sparsity, random-access flexibility, and length generalization. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.