FocuSFT: 用于稀释感知长上下文微调的双层优化
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
May 11, 2026
作者: Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
cs.AI
摘要
大语言模型现已能处理日益增长的输入长度,但其有效利用分散在长序列上下文中的信息的能力仍然有限。我们将这一差距归因于在长序列监督微调(SFT)过程中注意力预算的分配方式:位置偏差和注意力汇聚导致模型将大部分注意力分配给具有位置优先性的词元,而非语义相关的核心内容。这种训练时的注意力稀释(即注意力分布中内容词元被忽视)削弱了梯度信号,限制了模型学习稳健长上下文能力的能力。我们提出FocuSFT,一种在训练时解决该问题的双层优化框架。内循环根据训练上下文调整轻量级快速权重参数,形成参数化记忆,将注意力集中于相关内容;外循环则以这种锐化的表示为基础进行监督微调。两个循环均对上下文词元应用双向注意力,同时保持对响应的因果掩码,从而减少产生注意力汇聚的因果不对称性,并使内外循环行为一致。在BABILong数据集上,FocuSFT在4K到32K上下文长度范围内准确率提升最高达14个百分点;在RULER数据集上,16K长度下CWE聚合得分从72.9%提高到81.1%;在结合智能体工具使用的GPQA任务中,pass@1指标相对提升24%。注意力分析表明,FocuSFT在训练期间将注意力汇聚质量降低了529倍,并将上下文参与度提升至三倍。代码:https://github.com/JarvisPei/FocuSFT
English
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529times and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT