回答正解の長いCoTトレーニングトレースにおける有害な継続の診断

要旨

長い思考連鎖（CoT）のトレースは、推論指向のLLM SFTにおける教師信号として広く利用されているが、回答が正しいトレースであっても、ファインチューニングの結果に著しい違いをもたらすことがある。本研究では、回答が正しい長CoTデータにおける「結論後の継続（post-conclusion continuation）」を検討する。これは、回答が十分に裏付けられているように見えるにもかかわらず、トレースが追加の推論を続け、それが教師対象に含まれる現象である。その訓練効果を検証するため、削除のみのエディタ（delete-only editor）を用いて、回答を維持したままサフィックスを除去する処理を行い、元のトレースと処理後のトレースに基づくCoTベースのSFTを比較した。その結果、エディタが特定した結論後の継続を除去した後のSFTの結果が改善されることが観察され、本設定においてこの継続が訓練に有害であることが示唆された。そこで、この経験的に裏付けられた現象を「有害継続（harmful continuation）」と呼ぶ。さらに、この介入に加え、除去された結論後の継続を不確実性と隠れ状態の進行度（hidden-state progress）の観点から特徴付けた。局所的な不確実性が持続するとともに、終端方向への進行度が弱まり、不確実性と幾何学的特性のミスマッチ（uncertainty–geometry mismatch）が形成されることを確認した。最後に、エディタが特定した結論後の継続の境界を近似する軽量な境界代理手法として、「有害継続カット（HCC: Harmful Continuation Cut）」を実装した。

English

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.