大型語言模型推理過程錯誤識別的時間一致性

摘要

驗證對於有效的數學推理至關重要。我們提出了一種新的時間一致性方法，其中驗證者基於先前的評估迭代地精煉其判斷。與單輪驗證或多模型辯論方法不同，我們的方法利用一系列自我反思行動中的一致性來提高驗證的準確性。在多樣化的數學過程錯誤識別基準（Mathcheck、ProcessBench 和 PRM800K）上的實證評估顯示，相較於基線方法，我們的方法展現了持續的性能提升。當應用於最近的 DeepSeek R1 蒸餾模型時，我們的方法表現出強勁的性能，使 7B/8B 蒸餾模型在 ProcessBench 上超越了所有 70B/72B 模型和 GPT-4o。值得注意的是，採用我們方法的 14B 蒸餾模型達到了與 Deepseek-R1 相當的性能。我們的代碼可在 https://github.com/jcguo123/Temporal-Consistency 獲取。

English

Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency

大型語言模型推理過程錯誤識別的時間一致性

Temporal Consistency for LLM Reasoning Process Error Identification

摘要

Support