ハイブリッドLLMにおける注意忘却：CoTファインチューニングが長距離想起を阻害する問題とその修正方法

要旨

チェーン・オブ・ソート（CoT）教師ありファインチューニング（SFT）は、推論能力向上のために広く採用されているが、ハイブリッド線形アテンションモデルにおいて長文脈の想起を体系的に低下させることを我々は発見した。HypeNetやJet-Nemotronなどのアーキテクチャにおいて、Needle-In-A-Haystack（NIAH）に対する検索性能はCoT-SFT後に大幅に低下し、より困難な検索設定や長いコンテキストウィンドウではその劣化はさらに深刻になる。例えば、HypeNet-9BのNIAH-S2@256Kは67.2%から9.4%に低下する。この原因は、CoT-SFTがアテンション勾配を近距離パターンに偏らせ、長距離ルーティングを担うクエリ・キー投影（W_Q, W_K）を混乱させることにあると我々は考える。この観察に動機づけられ、我々はQK-Restoreを提案する。これは、SFT前のチェックポイントからW_QとW_Kのみを復元し、それ以外のSFT後のパラメータはそのまま保持する、学習不要の手法である。さらに、ルーティングの保持と推論への適応のバランスをとるために、Procrustes変種を導入する。QK-Restoreは、様々なアーキテクチャにおいて、推論性能を維持しつつ、ゼロの学習コストで長文脈能力を一貫して回復する。例えば、HypeNet-5BではS3@256Kを65.4%から76.4%に改善し、強力な推論性能を維持する。

English

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections (W_Q, W_K) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.