强化快速权重与下一序列预测

摘要

快速权重架构通过保持与上下文长度无关的恒定内存开销，为长上下文建模提供了基于注意力机制的Transformer模型的有力替代方案。然而，其潜力受限于下一词预测训练范式——该范式仅优化单词预测，而忽略前缀之后多词间的语义连贯性。因此，快速权重模型（通过动态更新参数存储上下文信息）会学习到次优表征，难以捕捉长程依赖关系。我们提出REFINE框架（基于下一序列预测的强化快速权重），采用强化学习方法在下一序列预测目标下训练快速权重模型。REFINE通过预测熵选择信息量丰富的词位，生成多词展开序列，分配自监督的序列级奖励，并采用组相对策略优化进行模型优化。该框架可应用于预训练语言模型的完整训练周期：中期训练、后期训练及测试时训练。我们在LaCT-760M和DeltaNet-1.3B上的实验表明，REFINE在"大海捞针"检索、长上下文问答以及LongBench多样任务中均持续优于基于下一词预测的监督微调。REFINE为提升快速权重架构的长上下文建模能力提供了高效通用的解决方案。

English

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

强化快速权重与下一序列预测

Reinforced Fast Weights with Next-Sequence Prediction

摘要

Support