ReForm：基于前瞻有界序列优化的反思式自动形式化

摘要

自动形式化旨在将自然语言数学问题转化为机器可验证的形式化表述，这对于利用形式化数学推理解决自然语言表述的数学问题至关重要。虽然大语言模型能够生成语法正确的形式化语句，但往往难以保持原始问题的语义意图。这一局限源于现有方法将自动形式化简单视为翻译任务，缺乏人类专家自然运用的自我反思与迭代优化机制。为解决这些问题，我们提出ReForm——一种融合语义一致性评估的反思式自动形式化方法。该方法使模型能够迭代生成形式化语句，评估语义保真度，并通过渐进优化实现自我纠错。为有效训练这一反思模型，我们提出前瞻有界序列优化算法，该算法通过在序列不同位置施加差异化奖励，确保模型既掌握准确的形式化转换，又能进行正确的语义验证，从而避免流于表面的批判性反馈破坏反思机制的本质。在四个自动形式化基准测试上的大量实验表明，ReForm相较最强基线模型平均提升17.2个百分点。为进一步确保评估可靠性，我们构建了ConsistencyCheck基准数据集，包含859个专家标注样本。该数据集不仅验证了大语言模型作为评判者的有效性，还揭示了自动形式化本身固有的难度：即使人类专家在高达38.5%的情况下也会产生语义错误。

English

Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

ReForm：基于前瞻有界序列优化的反思式自动形式化

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

摘要

Support