引理:從錯誤中學習以促進大型語言模型的數學進步
LEMMA: Learning from Errors for MatheMatical Advancement in LLMs
March 21, 2025
作者: Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, Lijun Wu
cs.AI
摘要
大型語言模型(LLMs)在解決數學問題方面展現了卓越的推理能力。然而,現有方法主要集中於提升正確訓練數據的質量,例如從高級模型中提煉高質量的正確解答,而忽視了錯誤數據中所蘊含的價值,這可能阻礙模型的反思能力。儘管一些研究嘗試利用錯誤數據,但它們通常涉及複雜的機制,如蒙特卡羅樹搜索(MCTS)來探索錯誤節點。在本研究中,我們提出通過從錯誤中學習以促進數學進步(LEMMA)來增強LLMs的推理能力。LEMMA構建了包含錯誤步驟的錯誤解答及與正確解答的反思連接的數據集進行微調。具體而言,我們系統地分析了模型生成的錯誤類型,並引入了一種基於錯誤類型的錯誤增強方法,以收集多樣且具代表性的錯誤。正確解答或通過修正錯誤獲得,或重新生成。通過模型感知的平滑反思連接,錯誤解答被轉化為正確解答。通過在構建的數據集上進行微調,模型能夠在生成過程中自主糾正錯誤,而無需依賴外部批評模型。實驗結果表明,LEMMA相較於其他強基線取得了顯著的性能提升。
English
Large language models (LLMs) have demonstrated remarkable reasoning
capability in solving mathematical problems. However, existing approaches
primarily focus on improving the quality of correct training data, e.g.,
distilling high-quality correct solutions from advanced models, neglecting the
value contained in error data, potentially hindering the model's reflective
ability. Though some studies attempt to leverage error data, they often involve
complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error
nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning
from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data
consisting of an incorrect solution with an erroneous step and a reflection
connection to a correct solution for fine-tuning. Specifically, we
systematically analyze the model-generated error types and introduce an
error-type grounded mistake augmentation method to collect diverse and
representative errors. Correct solutions are either from fixing the errors or
generating a fresh start. Through a model-aware smooth reflection connection,
the erroneous solution is transferred to the correct one. By fine-tuning on the
constructed dataset, the model is able to self-correct errors autonomously
within the generation process without relying on external critique models.
Experimental results demonstrate that LEMMA achieves significant performance
improvements over other strong baselines.Summary
AI-Generated Summary