FocuSFT: 희석 인식 긴 문맥 미세 조정을 위한 이중 최적화

초록

최신 대규모 언어 모델은 점점 더 긴 입력을 처리할 수 있지만, 긴 맥락에 분산된 정보를 효과적으로 활용하는 능력은 여전히 제한적입니다. 우리는 이러한 격차가 긴 시퀀스에 대한 지도 미세 조정(SFT) 과정에서 주의 예산이 어떻게 사용되는지에 기인한다고 설명합니다: 위치 편향과 주의 싱크(attention sink)로 인해 모델은 의미적으로 관련 있는 내용보다 위치적으로 특권을 가진 토큰에 주의를 집중하게 됩니다. 이러한 훈련 시 주의 희석(주의 분포에서 내용 토큰이 소외되는 현상)은 그래디언트 신호를 약화시켜 모델이 강건한 장거리 맥락 능력을 학습하는 것을 제한합니다. 우리는 이 문제를 훈련 시점에서 해결하는 이중 수준 최적화 프레임워크인 FocuSFT를 소개합니다. 내부 루프는 훈련 맥락에 대해 가벼운 빠른 가중치 파라미터를 적응시켜 관련 내용에 주의를 집중시키는 파라메트릭 메모리를 형성하고, 외부 루프는 이렇게 선명해진 표현을 조건으로 SFT를 수행합니다. 두 루프 모두 맥락 토큰에 대해 양방향 주의를 적용하면서 응답에 대해서는 인과적 마스킹을 유지하여, 주의 싱크를 발생시키는 인과적 비대칭성을 줄이고 내부-외부 동작을 정렬합니다. BABILong에서 FocuSFT는 4K에서 32K까지의 맥락 길이에 걸쳐 정확도를 최대 +14pp 향상시킵니다. RULER에서는 16K에서 CWE 집계를 72.9%에서 81.1%로 끌어올립니다. 에이전트 도구 사용이 포함된 GPQA에서는 pass@1에서 24%의 상대적 향상을 보입니다. 주의 분석 결과 FocuSFT는 훈련 중 주의 싱크 질량을 529배 줄이고 맥락 참여를 세 배 증가시키는 것으로 나타났습니다. 코드: https://github.com/JarvisPei/FocuSFT

English

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529times and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

FocuSFT: 희석 인식 긴 문맥 미세 조정을 위한 이중 최적화

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

초록

Support