关于大语言模型推理数据选择中的步长混淆问题

摘要

近期，大型推理模型通过在高质量大规模数据集上的监督微调，在需要长链条思维推理的复杂任务中展现出强大性能。为构建此类数据集，现有流程通常从能力更强的大型语言模型生成长推理数据，并采用人工启发式或自然度筛选方法来过滤高质量样本。尽管基于自然度的数据选择方法（即通过LLM赋予的平均对数概率对数据排序）已被证明有效，但我们的分析表明，当应用于LLM推理数据集时，该方法会系统性地偏好推理步骤更长（即每步包含更多标记）的样本而非更高质量的样本，我们将此现象称为步长混淆。通过量化分析，我们将此现象归因于推理步骤中首标记的低概率特性：更长的步骤会稀释其影响，从而抬升平均对数概率。为解决该问题，我们提出两种改进方法：ASLEC-DROP在计算平均对数概率时剔除首标记概率，ASLEC-CASL则采用因果去偏回归消除首标记的混淆效应。在四个LLM和五个评估基准上的实验表明，我们的方法能有效缓解步长混淆问题。

English

Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.