StateX：通過訓練後狀態擴展提升循環神經網絡的記憶能力

摘要

尽管基于Transformer的模型在语言建模方面展现了卓越的性能，但其高复杂性导致在处理长上下文时成本高昂。相比之下，诸如线性注意力机制和状态空间模型等循环神经网络（RNNs）因其每标记的恒定复杂度而广受欢迎。然而，这些循环模型在需要从长上下文中准确回忆上下文信息的任务上表现不佳，因为所有上下文信息都被压缩到一个恒定大小的循环状态中。先前的研究表明，回忆能力与循环状态的大小呈正相关，但直接训练具有更大循环状态的RNNs会导致高昂的训练成本。本文中，我们介绍了StateX，一种通过后训练有效扩展预训练RNNs状态的训练流程。针对线性注意力机制和状态空间模型这两类流行的RNNs，我们设计了后训练架构修改，以在不增加或仅轻微增加模型参数的情况下扩大状态规模。在参数高达1.3B的模型上的实验表明，StateX有效提升了RNNs的回忆能力和上下文学习能力，而不会产生高昂的后训练成本或损害其他能力。

English

While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.

StateX：通過訓練後狀態擴展提升循環神經網絡的記憶能力

StateX: Enhancing RNN Recall via Post-training State Expansion

摘要

Support