ChatPaper.aiChatPaper

ReForm:基於前瞻有界序列最優化的反思式自動形式化

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

October 28, 2025
作者: Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
cs.AI

摘要

自動形式化技術能將自然語言數學轉換為機器可驗證的形式化陳述,對於運用形式化數學推理解決自然語言表述的數學問題至關重要。儘管大型語言模型能生成語法正確的形式化陳述,卻往往難以保持原始問題的語義意圖。這種侷限性源於現有方法將自動形式化簡單視為翻譯任務,缺乏人類專家自然採用的自我反思與迭代優化機制。為解決這些問題,我們提出ReForm——一種融合語義一致性評估的反思式自動形式化方法。該方法使模型能迭代生成形式化陳述、評估語義保真度,並通過漸進優化實現自我糾錯。為有效訓練此反思模型,我們提出前瞻有界序列優化(PBSO)算法,通過在序列不同位置施加差異化獎勵機制,確保模型同步提升自動形式化精度與語義驗證準確性,避免流於表面的批判損害反思本質。在四個自動形式化基準測試上的大量實驗表明,ReForm相較最強基線模型平均提升17.2個百分點。為進一步確保評估可靠性,我們構建包含859項專家標註數據的ConsistencyCheck基準集,不僅驗證了LLM作為評判者的有效性,更揭示出自動形式化本身的高難度:即使人類專家也會在最高38.5%的案例中產生語義錯誤。
English
Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
PDF512December 2, 2025