FocuSFT: 希釈を考慮した長文脈微調整のための二段階最適化

要旨

大規模言語モデルはますます長い入力処理が可能になっているが、長いコンテキストに分散した情報を効果的に活用する能力は依然として限られている。我々はこのギャップを、長いシーケンスに対する教師付きファインチューニング（SFT）におけるアテンション予算の使われ方に起因するものと捉えている。位置バイアスとアテンションシンクにより、モデルは意味的に重要なコンテンツではなく、位置的に特権的なトークンにほとんどのアテンションを割り当てる。この訓練時のアテンション希釈（アテンション分布におけるコンテンツトークンの飢餓）は勾配信号を弱め、モデルがロバストな長文脈能力を学習する能力を制限する。我々は、訓練時にこの問題に対処する二段階最適化フレームワークFocuSFTを導入する。内部ループは訓練コンテキスト上の軽量な高速重みパラメータを適応させ、関連コンテンツにアテンションを集中させるパラメトリックメモリを形成する。外部ループはこの鮮鋭化された表現に条件付けられたSFTを実行する。両方のループは、応答に対する因果マスキングを維持しながらコンテキストトークンに対して双方向アテンションを適用し、アテンションシンクを引き起こす因果非対称性を低減し、内部と外部の動作を整合させる。BABILongでは、FocuSFTは4K～32Kのコンテキスト長にわたって最大+14ppの精度向上を達成した。RULERでは、16KにおいてCWE集計を72.9％から81.1％に向上させた。エージェンツール使用を伴うGPQAでは、pass@1で24％の相対的な改善を示した。アテンション分析により、FocuSFTは訓練中にアテンションシンク質量を529倍削減し、コンテキストエンゲージメントを3倍にすることが示された。コード: https://github.com/JarvisPei/FocuSFT

English

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529times and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

FocuSFT: 希釈を考慮した長文脈微調整のための二段階最適化

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

要旨

Support