每词皆关键:大型语言模型中1600万超长上下文的泛化能力
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
November 28, 2025
作者: Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li
cs.AI
摘要
本研究探讨了构建"具备记忆能力的机器"这一挑战,将长期记忆问题归结为高效超长上下文建模的核心难题。我们提出该问题需具备三大关键特性:稀疏性、随机访问灵活性及长度泛化能力。为应对超长上下文建模挑战,我们采用新型注意力机制——分层稀疏注意力(HSA),该机制同时满足上述三个特性。通过将HSA集成至Transformer架构,我们构建了HSA-UltraLong模型:这是一个基于8B参数的混合专家模型,在超过8万亿token上完成训练,并在领域内外不同上下文长度的任务中进行了严格评估。实验结果表明,在领域内上下文长度任务上,本模型性能与全注意力基线模型相当,同时在长达1600万token的上下文检索任务中,多数场景下准确率超过90%。本报告系统阐述了实验发现与开放性问题,为超长上下文建模的未来研究奠定了理论基础。
English
This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: sparsity, random-access flexibility, and length generalization. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.