ChatPaper.aiChatPaper

壁虎:一种高效处理任意长度序列的固有神经架构

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

January 10, 2026
作者: Xuezhe Ma, Shicheng Wen, Linghao Jin, Bilge Acun, Ruihang Lai, Bohan Hou, Will Lin, Hao Zhang, Songlin Yang, Ryan Lee, Mengxi Wu, Jonathan May, Luke Zettlemoyer, Carole-Jean Wu
cs.AI

摘要

设计一种能够高效、内在地处理任意长度序列数据的统一神经网络,是序列建模领域核心且具有挑战性的问题。Transformer架构中的二次复杂度与弱长度外推等设计限制其向长序列扩展的能力。本研究提出Gecko神经网络架构,该架构继承Mega和Megalodon的设计思想(采用带门控注意力的指数移动平均机制),并进一步引入多项技术组件以增强长程依赖捕捉能力,包括时间步衰减归一化、滑动分块注意力机制和自适应工作记忆。在70亿参数规模、2万亿训练标记量的控制性预训练实验中,Gecko相较于Llama2和Megalodon展现出更优的效能与长上下文扩展性:其训练损失降至1.68,显著优于Llama2-7B(1.75)和Megalodon-7B(1.70),接近Llama2-13B(1.67)的水平。值得注意的是,在不依赖任何上下文扩展技术的情况下,Gecko展现出内在的长上下文处理与检索能力,可稳定处理长达400万标记的序列,并能从超出其注意力窗口4倍长度的上下文中检索信息。代码地址:https://github.com/XuezheMax/gecko-llm
English
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4times longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm
PDF23January 31, 2026