ReForm: 전향적 유계 시퀀스 최적화를 통한 반성적 자동 형식화

초록

자연어 수학을 기계 검증 가능한 형식 명제로 변환하는 자동형식화는 자연어로 기술된 수학 문제를 형식적 수학 추론을 통해 해결하는 데 핵심적입니다. 대규모 언어 모델은 구문적으로 정확한 형식 명제를 생성할 수 있으나, 원본 문제의 의미론적 의도를 보존하지 못하는 경우가 많습니다. 이러한 한계는 LLM 접근법이 자동형식화를 단순한 번역 작업으로 취급하여 인간 전문가가 당연히 수행하는 자기 반성과 반복적 개선 메커니즘이 부재하기 때문에 발생합니다. 이러한 문제를 해결하기 위해 우리는 의미 일관성 평가를 자동형식화 과정에 긴밀하게 통합하는 반성적 자동형식화 방법인 ReForm을 제안합니다. 이를 통해 모델이 형식 명제를 반복적으로 생성하고, 생성된 명제의 의미론적 충실도를 평가하며, 식별된 오류를 점진적 개선을 통해 자가 수정할 수 있습니다. 이러한 반성적 모델을 효과적으로 훈련시키기 위해 우리는 Prospective Bounded Sequence Optimization(PBSO)을 도입했습니다. PBSO는 시퀀스의 서로 다른 위치에 서로 다른 보상을 적용하여 모델이 정확한 자동형식화와 올바른 의미론적 검증 능력을 모두 발전시키도록 하여, 반성의 목적을 훼손할 수 있는 피상적인 비판을 방지합니다. 4개의 자동형식화 벤치마크에 대한 포괄적인 실험 결과, ReForm이 가장 강력한 기준 모델 대비 평균 17.2% 포인트 향상된 성능을 달성했음을 보여줍니다. 평가 신뢰성을 더욱 확보하기 위해 우리는 859개의 전문가 주석 항목으로 구성된 ConsistencyCheck 벤치마크를 도입했습니다. 이 벤치마크는 LLM을 평가자로 검증할 뿐만 아니라, 자동형식화가 본질적으로 어려운 작업임을 드러내며, 심지어 인간 전문가조차 최대 38.5%의 경우에서 의미론적 오류를 생성함을 보여줍니다.

English

Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

ReForm: 전향적 유계 시퀀스 최적화를 통한 반성적 자동 형식화

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

초록

Support