FocuSFT:針對稀釋感知長上下文微調的雙層優化
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
May 11, 2026
作者: Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
cs.AI
摘要
大型語言模型現今能處理越來越長的輸入序列,但其有效運用分散於長上下文中的資訊能力仍有限。我們將此差距歸因於監督式微調(SFT)在長序列上的注意力預算分配方式:位置偏誤與注意力匯點導致模型將大部分注意力分配給具位置特權的詞元,而非語意相關的內容。這種訓練期間的注意力稀釋(注意力分布中內容詞元的匱乏)削弱了梯度訊號,限制了模型學習穩健長上下文能力。我們提出FocuSFT,這是一個雙層最佳化框架,能在訓練階段解決此問題。內層迴圈在訓練上下文上適應輕量快速權重參數,形成一個能將注意力集中於相關內容的參數化記憶;外層迴圈則在基於此強化表徵的條件下進行SFT。兩個迴圈都對上下文詞元應用雙向注意力,同時保留對回應的因果遮罩,從而減少引發注意力匯點的因果不對稱性,並使內外層行為一致。在BABILong上,FocuSFT在4K至32K上下文長度範圍內提升準確率高達+14個百分點;在RULER上,它將16K長度下的CWE聚合分數從72.9%提升至81.1%;而在結合代理工具使用的GPQA上,pass@1指標相對提升24%。注意力分析顯示,FocuSFT在訓練期間將注意力匯點質量降低529倍,並使上下文參與度提升三倍。程式碼:https://github.com/JarvisPei/FocuSFT
English
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529times and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT