ChatGLM-Math:通過自我評估管道改進大型語言模型中的數學問題解決
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
April 3, 2024
作者: Yifan Xu, Xiao Liu, Xinghan Liu, Zhenyu Hou, Yueyan Li, Xiaohan Zhang, Zihan Wang, Aohan Zeng, Zhengxiao Du, Wenyi Zhao, Jie Tang, Yuxiao Dong
cs.AI
摘要
大型語言模型(LLMs)展現出優秀的掌握人類語言能力,但在需要數學問題解決的實際應用中仍然面臨困難。儘管已經開發了許多增強LLMs數學能力的策略和數據集,但在部署的LLM系統中同時保持和提升語言和數學能力仍然是一個挑戰。在這項工作中,我們定制了自我評論流程,該流程解決了LLM校準的反饋學習階段中的挑戰。我們首先從LLM本身訓練一個通用的數學評論模型來提供反饋信號。然後,我們依次採用拒絕微調和直接偏好優化LLM自身生成的數據收集。基於ChatGLM3-32B,我們在學術界和我們新創建的具有挑戰性的數據集MathUserEval上進行了一系列實驗。結果顯示,我們的流程顯著增強了LLM的數學問題解決能力,同時仍然提高了其語言能力,優於可能大兩倍的LLMs。相關技術已經部署到ChatGLM\url{https://chatglm.cn},這是一個在線服務的LLM。相關的評估數據集和腳本已在https://github.com/THUDM/ChatGLM-Math上發布。
English
Large language models (LLMs) have shown excellent mastering of human
language, but still struggle in real-world applications that require
mathematical problem-solving. While many strategies and datasets to enhance
LLMs' mathematics are developed, it remains a challenge to simultaneously
maintain and improve both language and mathematical capabilities in deployed
LLM systems.In this work, we tailor the Self-Critique pipeline, which addresses
the challenge in the feedback learning stage of LLM alignment. We first train a
general Math-Critique model from the LLM itself to provide feedback signals.
Then, we sequentially employ rejective fine-tuning and direct preference
optimization over the LLM's own generations for data collection. Based on
ChatGLM3-32B, we conduct a series of experiments on both academic and our newly
created challenging dataset, MathUserEval. Results show that our pipeline
significantly enhances the LLM's mathematical problem-solving while still
improving its language ability, outperforming LLMs that could be two times
larger. Related techniques have been deployed to
ChatGLM\url{https://chatglm.cn}, an online serving LLM. Related
evaluation dataset and scripts are released at
https://github.com/THUDM/ChatGLM-Math.Summary
AI-Generated Summary