為什麼推理模型會失去覆蓋範圍?數據與路徑分岔的角色
Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road
May 16, 2026
作者: Ngoc-Hieu Nguyen, Parshin Shojaee, Phuc Minh Nguyen, Nan Zhang, Chandan K Reddy, Khoa D Doan, Rui Zhang
cs.AI
摘要
大型語言模型的近期進展催生了推理模型的出現,這些模型透過專門的微調程序,在複雜任務上展現出強大效能。儘管這些方法能可靠地提升pass@1準確率,但先前研究觀察到它們出現覆蓋率收縮行為,即相對於基礎模型,pass@k表現有所下降。本文探討基於SFT的後訓練過程中推理收縮的成因。我們假設此行為源於微調數據的特性,特別是與「決策點」或「道路叉口」情境相關,在這些情境中,模型面對難以辨識的模式且存在多種有效的推理路徑。為驗證此假設,我們設計了受控案例研究,模擬此類決策點設定,涵蓋圖形分支中的難以辨識節點以及推理模式。透過追蹤這些設定下的後訓練動態,我們發現收縮現象與訓練數據中決策點情境的普遍性密切相關。我們也證明,透過針對性的決策點數據合成設計,以及更具系統性的鼓勵多樣性的解碼機制,可部分緩解此收縮行為。我們的研究結果指出,以數據為中心的因素是推理模型收縮的關鍵驅動力,並強調多樣性感知設計是控制此現象的有效槓桿。
English
Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or "forks in the road" scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.