StepFun-Formalizer:通过知识推理融合释放大语言模型的自动形式化潜能
StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion
August 6, 2025
作者: Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu
cs.AI
摘要
自动形式化旨在将自然语言表述的数学命题转化为形式化语言。尽管大语言模型(LLMs)推动了该领域的进展,现有方法仍存在准确率低的问题。我们识别出有效自动形式化的两大关键能力:对形式化语言领域知识的全面掌握,以及对自然语言问题理解与非正式-正式对齐的推理能力。缺乏前者,模型无法识别正确的形式化对象;缺失后者,则难以解读现实语境并将其精确映射为形式化表达。为弥补这些不足,我们提出了ThinkingF,一个数据合成与训练流程,旨在提升这两项能力。首先,我们构建了两个数据集:一个通过提炼和筛选富含形式化知识的大规模样例,另一个则依据专家设计的模板生成从非正式到正式的推理轨迹。随后,我们运用这些数据集进行监督微调(SFT)和强化学习与验证奖励(RLVR),以进一步融合和精炼这两项能力。由此得到的7B和32B模型展现出全面的形式化知识及强大的非正式到正式推理能力。特别地,StepFun-Formalizer-32B在FormalMATH-Lite和ProverBench上分别以40.5%和26.7%的BEq@1分数创下新纪录,超越了所有先前通用及专用模型的表现。
English
Autoformalization aims to translate natural-language mathematical statements
into a formal language. While LLMs have accelerated progress in this area,
existing methods still suffer from low accuracy. We identify two key abilities
for effective autoformalization: comprehensive mastery of formal-language
domain knowledge, and reasoning capability of natural language problem
understanding and informal-formal alignment. Without the former, a model cannot
identify the correct formal objects; without the latter, it struggles to
interpret real-world contexts and map them precisely into formal expressions.
To address these gaps, we introduce ThinkingF, a data synthesis and training
pipeline that improves both abilities. First, we construct two datasets: one by
distilling and selecting large-scale examples rich in formal knowledge, and
another by generating informal-to-formal reasoning trajectories guided by
expert-designed templates. We then apply SFT and RLVR with these datasets to
further fuse and refine the two abilities. The resulting 7B and 32B models
exhibit both comprehensive formal knowledge and strong informal-to-formal
reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5%
on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior
general-purpose and specialized models.