정답이 포함된 긴 CoT 훈련 추적에서 유해한 연속성 진단

초록

긴 사고 사슬(CoT) 흔적은 추론 중심 LLM SFT를 위한 지도 학습 데이터로 널리 사용되지만, 정답 흔적이라 할지라도 미세 조정 결과에 현저한 차이를 초래할 수 있다. 본 연구에서는 정답인 긴 CoT 데이터에서 결론 이후의 연속(Post-conclusion continuation)을 분석한다. 이는 정답이 충분히 뒷받침된 후에도 흔적이 계속되어 추가적인 추론이 지도 대상에 포함되는 현상이다. 그 훈련 효과를 시험하기 위해, 삭제 전용 편집기를 사용하여 정답을 유지하는 접미사 제거를 구성하고, 원본 흔적과 처리된 흔적에 대해 CoT 기반 SFT를 비교한다. 편집기가 식별한 결론 이후의 연속을 제거한 후 SFT 결과가 개선됨을 관찰하였으며, 이는 본 설정에서 해당 연속이 훈련에 해롭다는 것을 시사한다. 따라서 본 연구에서는 이 경험적으로 뒷받침된 현상을 유해한 연속(Harmful continuation)이라고 명명한다. 이 개입 외에도, 제거된 결론 이후의 연속을 불확실성과 은닉 상태 진행 측면에서 추가로 특성화한다. 지속적인 국소적 불확실성과 약화된 종단 방향 진행이 관찰되며, 이는 불확실성-기하학적 불일치(Uncertainty–geometry mismatch)를 형성한다. 마지막으로, 편집기가 식별한 결론 이후의 연속 경계를 근사화하는 경량 경계 대리자(Boundary proxy)인 유해한 연속 절단(HCC)을 구현한다.

English

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.