大语言模型推理过程中时间一致性错误识别
Temporal Consistency for LLM Reasoning Process Error Identification
March 18, 2025
作者: Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang
cs.AI
摘要
验证对于有效的数学推理至关重要。我们提出了一种新的时间一致性方法,其中验证者基于先前的评估迭代优化其判断。与单轮验证或多模型辩论方法不同,我们的方法利用一系列自我反思行为中的一致性来提高验证准确性。在多种数学过程错误识别基准(Mathcheck、ProcessBench和PRM800K)上的实证评估显示,相较于基线方法,我们的方法在性能上取得了持续提升。当应用于最新的DeepSeek R1蒸馏模型时,我们的方法展现了强劲的性能,使得7B/8B蒸馏模型在ProcessBench上超越了所有70B/72B模型及GPT-4o。值得注意的是,采用我们方法的14B蒸馏模型达到了与Deepseek-R1相当的性能。我们的代码已公开于https://github.com/jcguo123/Temporal-Consistency。
English
Verification is crucial for effective mathematical reasoning. We present a
new temporal consistency method where verifiers iteratively refine their
judgments based on the previous assessment. Unlike one-round verification or
multi-model debate approaches, our method leverages consistency in a sequence
of self-reflection actions to improve verification accuracy. Empirical
evaluations across diverse mathematical process error identification benchmarks
(Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements
over baseline methods. When applied to the recent DeepSeek R1 distilled models,
our method demonstrates strong performance, enabling 7B/8B distilled models to
outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the
distilled 14B model with our method achieves performance comparable to
Deepseek-R1. Our codes are available at
https://github.com/jcguo123/Temporal-ConsistencySummary
AI-Generated Summary