추론 모델이 커버리지를 상실하는 이유: 데이터와 갈림길의 역할

초록

최근 대규모 언어 모델의 발전은 추론 모델의 등장으로 이어졌으며, 이들 모델은 특수한 미세 조정 절차를 통해 복잡한 작업에서 강력한 성능을 보여주었다. 이러한 방법들은 pass@1 정확도를 안정적으로 향상시키지만, 기존 연구에서는 기본 모델 대비 pass@k가 저하되는 커버리지 수축(coverage shrinkage) 현상이 나타난다는 점을 관찰했다. 본 논문에서는 SFT 기반 사후 학습 하에서 추론 수축이 발생하는 원인을 조사한다. 우리는 이러한 행동이 미세 조정 데이터의 특성, 특히 모델이 여러 유효한 추론 경로를 가진 해독 불가 패턴에 직면하는 결정 지점(decision points) 또는 "갈림길(forks in the road)" 시나리오에 의해 유발된다고 가정한다. 이 가설을 검증하기 위해, 그래프 분기의 해독 불가 노드와 추론 모드를 아우르는 결정 지점 설정을 시뮬레이션하는 통제된 사례 연구를 설계한다. 이러한 설정에서 사후 학습 동역학을 추적한 결과, 수축 현상이 훈련 데이터 내 결정 지점 시나리오의 빈도와 밀접하게 상관관계가 있음을 발견했다. 또한, 이러한 수축 행동이 결정 지점에 대한 표적 데이터 합성 설계와 보다 체계적인 다양성 장려 디코딩 메커니즘을 통해 부분적으로 완화될 수 있음을 입증한다. 본 연구의 결과는 데이터 중심 요인이 추론 모델의 수축을 유발하는 주요 요인임을 식별하고, 다양성을 고려한 설계가 이를 제어하는 효과적인 수단임을 강조한다.

English

Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or "forks in the road" scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.