ReForm:基於前瞻有界序列最優化的反思式自動形式化
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
October 28, 2025
作者: Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
cs.AI
摘要
自動形式化技術能將自然語言數學轉換為機器可驗證的形式化陳述,對於運用形式化數學推理解決自然語言表述的數學問題至關重要。儘管大型語言模型能生成語法正確的形式化陳述,卻往往難以保持原始問題的語義意圖。這種侷限性源於現有方法將自動形式化簡單視為翻譯任務,缺乏人類專家自然採用的自我反思與迭代優化機制。為解決這些問題,我們提出ReForm——一種融合語義一致性評估的反思式自動形式化方法。該方法使模型能迭代生成形式化陳述、評估語義保真度,並通過漸進優化實現自我糾錯。為有效訓練此反思模型,我們提出前瞻有界序列優化(PBSO)算法,通過在序列不同位置施加差異化獎勵機制,確保模型同步提升自動形式化精度與語義驗證準確性,避免流於表面的批判損害反思本質。在四個自動形式化基準測試上的大量實驗表明,ReForm相較最強基線模型平均提升17.2個百分點。為進一步確保評估可靠性,我們構建包含859項專家標註數據的ConsistencyCheck基準集,不僅驗證了LLM作為評判者的有效性,更揭示出自動形式化本身的高難度:即使人類專家也會在最高38.5%的案例中產生語義錯誤。
English
Autoformalization, which translates natural language mathematics into
machine-verifiable formal statements, is critical for using formal mathematical
reasoning to solve math problems stated in natural language. While Large
Language Models can generate syntactically correct formal statements, they
often fail to preserve the original problem's semantic intent. This limitation
arises from the LLM approaches' treating autoformalization as a simplistic
translation task which lacks mechanisms for self-reflection and iterative
refinement that human experts naturally employ. To address these issues, we
propose ReForm, a Reflective Autoformalization method that tightly integrates
semantic consistency evaluation into the autoformalization process. This
enables the model to iteratively generate formal statements, assess its
semantic fidelity, and self-correct identified errors through progressive
refinement. To effectively train this reflective model, we introduce
Prospective Bounded Sequence Optimization (PBSO), which employs different
rewards at different sequence positions to ensure that the model develops both
accurate autoformalization and correct semantic validations, preventing
superficial critiques that would undermine the purpose of reflection. Extensive
experiments across four autoformalization benchmarks demonstrate that ReForm
achieves an average improvement of 17.2 percentage points over the strongest
baselines. To further ensure evaluation reliability, we introduce
ConsistencyCheck, a benchmark of 859 expert-annotated items that not only
validates LLMs as judges but also reveals that autoformalization is inherently
difficult: even human experts produce semantic errors in up to 38.5% of cases.