하이브리드 LLM에서의 어텐션 기억상실: CoT 미세 조정이 장거리 회상을 손상시킬 때와 이를 해결하는 방법

초록

사고 연쇄(CoT) 지도 미세 조정(SFT)은 추론 능력을 향상시키기 위해 널리 사용되지만, 하이브리드 선형 주의(attention) 모델에서 장문맥 검색 능력을 체계적으로 저하시킨다는 사실을 발견하였습니다. HypeNet과 Jet-Nemotron을 포함한 다양한 아키텍처에서 CoT-SFT 이후 건초더미 속 바늘 찾기(NIAH)에 대한 검색 성능이 크게 저하되며, 더 어려운 검색 설정과 더 긴 문맥 윈도우에서 그 저하가 더 심각해집니다. 예를 들어, NIAH-S2@256K에서 HypeNet-9B의 성능이 67.2%에서 9.4%로 감소합니다. 이 현상은 CoT-SFT가 주의 기울기를 단거리 패턴으로 편향시켜 장거리 라우팅을 담당하는 쿼리-키 투영(W_Q, W_K)을 방해하기 때문이라고 분석합니다. 이러한 관찰에 기반하여, 우리는 CoT-SFT 이전 체크포인트의 W_Q와 W_K만 복원하고 다른 모든 미세 조정 이후 파라미터는 유지하는, 훈련이 필요 없는 방법인 QK-Restore를 제안합니다. 또한 라우팅 보존과 추론 적응의 균형을 맞추기 위해 프로크루스테스(Procrustes) 변형을 추가로 도입합니다. 다양한 아키텍처에서 QK-Restore는 훈련 비용 없이 장문맥 능력을 일관되게 복원하면서도 추론 성능을 유지합니다. 예를 들어, HypeNet-5B에서 S3@256K가 65.4%에서 76.4%로 향상되면서 강력한 추론 성능을 유지합니다.

English

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections (W_Q, W_K) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.