大语言模型推理过程中时间一致性错误识别

摘要

验证对于有效的数学推理至关重要。我们提出了一种新的时间一致性方法，其中验证者基于先前的评估迭代优化其判断。与单轮验证或多模型辩论方法不同，我们的方法利用一系列自我反思行为中的一致性来提高验证准确性。在多种数学过程错误识别基准（Mathcheck、ProcessBench和PRM800K）上的实证评估显示，相较于基线方法，我们的方法在性能上取得了持续提升。当应用于最新的DeepSeek R1蒸馏模型时，我们的方法展现了强劲的性能，使得7B/8B蒸馏模型在ProcessBench上超越了所有70B/72B模型及GPT-4o。值得注意的是，采用我们方法的14B蒸馏模型达到了与Deepseek-R1相当的性能。我们的代码已公开于https://github.com/jcguo123/Temporal-Consistency。

English

Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency

大语言模型推理过程中时间一致性错误识别

Temporal Consistency for LLM Reasoning Process Error Identification

摘要

Support