ReForm: 将来有界シーケンス最適化による反射的自己形式化

要旨

自然言語で記述された数学を機械検証可能な形式記述に変換する自動形式化（Autoformalization）は、自然言語で述べられた数学問題を形式的数学推論によって解決する上で極めて重要である。大規模言語モデルは構文的に正しい形式記述を生成できるが、元の問題の意味的意図を保持できない場合が多い。この限界は、LLMアプローチが自動形式化を単純な翻訳タスクとして扱い、人間の専門家が自然に用いる内省的検討や反復的改良のメカニズムを欠いていることに起因する。これらの課題に対処するため、我々は意味的一貫性評価を自動形式化プロセスに緊密に統合した反射的自動形式化手法「ReForm」を提案する。これによりモデルは形式的記述を反復生成し、その意味的忠実性を評価し、特定された誤りを段階的改良によって自己修正できる。この反射的モデルを効果的に訓練するため、異なる系列位置で異なる報酬を用いることで、モデルが正確な自動形式化と適切な意味的検証の両方を習得し、反省の目的を損なう表面的批判を防止する「将来境界付き系列最適化（PBSO）」を導入した。4つの自動形式化ベンチマークにおける大規模実験により、ReFormが最強のベースラインを平均17.2ポイント上回ることを実証した。評価の信頼性をさらに確保するため、専門家が注釈を付けた859項目から成るConsistencyCheckベンチマークを導入。これはLLMを評価者として検証するだけでなく、自動形式化が本質的に困難であること（人間の専門家でさえ最大38.5%の事例で意味的誤りを生じる）を明らかにした。

English

Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

ReForm: 将来有界シーケンス最適化による反射的自己形式化

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

要旨

Support