Gecko : Une architecture neuronale efficace traitant intrinsèquement des séquences de longueurs arbitraires

papers.abstract

La conception d'un réseau neuronal unifié capable de traiter efficacement et intrinsèquement des données séquentielles de longueur arbitraire constitue un problème central et difficile dans la modélisation de séquences. Les choix de conception du Transformer, incluant sa complexité quadratique et sa faible extrapolation en longueur, ont limité sa capacité à passer à l'échelle pour de longues séquences. Dans ce travail, nous proposons Gecko, une architecture neuronale qui hérite de la conception de Mega et Megalodon (moyenne mobile exponentielle avec attention gated), et introduit en outre plusieurs composants techniques pour améliorer sa capacité à capturer les dépendances à longue portée, incluant une normalisation par décroissance temporelle, un mécanisme d'attention par fenêtre glissante et une mémoire de travail adaptative. Dans une comparaison d'apprentissage préalable contrôlée avec Llama2 et Megalodon à l'échelle de 7 milliards de paramètres et 2000 milliards de tokens d'entraînement, Gecko atteint une meilleure efficacité et une meilleure extensibilité au contexte long. Gecko atteint une perte d'entraînement de 1,68, surpassant significativement Llama2-7B (1,75) et Megalodon-7B (1,70), et se rapprochant de Llama2-13B (1,67). Fait notable, sans recourir à aucune technique d'extension de contexte, Gecko présente des capacités intrinsèques de traitement et de récupération en contexte long, gérant de manière stable des séquences allant jusqu'à 4 millions de tokens et récupérant des informations dans des contextes jusqu'à 4 fois plus longs que sa fenêtre d'attention. Code : https://github.com/XuezheMax/gecko-llm

English

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4times longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm

Gecko : Une architecture neuronale efficace traitant intrinsèquement des séquences de longueurs arbitraires

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

papers.abstract

Support