TrustGeoGen: 신뢰할 수 있는 다중 모드 기하 문제 해결을 위한 확장 가능하고 형식 검증된 데이터 엔진

초록

수학적 기하 문제 해결(GPS)은 종종 다중 모드 정보의 효과적인 통합과 검증 가능한 논리적 일관성을 요구합니다. 일반적인 문제 해결 분야에서 대형 언어 모델의 급속한 발전에도 불구하고, 특히 기존의 합성 GPS 벤치마크가 자가 검증되지 않고 LLM의 환상으로 인해 노이즈와 자기 모순 정보를 포함하고 있다는 사실을 고려할 때, 방법론과 벤치마크 모두에 대해 해결되지 않은 문제로 남아 있습니다. 본 논문에서는 문제 생성을 위한 확장 가능한 데이터 엔진인 TrustGeoGen을 제안하며, 형식적 검증을 통해 원칙적인 벤치마크를 제공함으로써 GPS 방법의 추가 발전을 위한 기반을 마련하고자 합니다. 이 엔진은 네 가지 주요 혁신을 통해 기하 데이터를 합성합니다: 1) 다이어그램, 텍스트 설명, 단계별 해결책의 다중 모드 정렬 생성; 2) 규칙 준수 추론 경로를 보장하는 형식적 검증; 3) 재귀적 상태 생성을 통해 복잡성 확장을 가능하게 하는 부트스트래핑 메커니즘; 4) 다중 해결책 변형과 자기 반추적 추적을 동시에 생성하는 우리가 고안한 GeoExplore 시리즈 알고리즘. 형식적 논리 검증을 통해 TrustGeoGen은 모달리티 무결성이 보장된 GeoTrust-200K 데이터셋과 GeoTrust-test 테스트셋을 생성합니다. 실험 결과, 최첨단 모델들이 GeoTrust-test에서 단 49.17%의 정확도를 달성함으로써 이 테스트셋의 평가 엄격성을 입증했습니다. 특히, GeoTrust로 훈련된 모델들은 GeoQA에서 OOD 일반화를 달성하며, OpenAI-o1에 의해 주석된 가짜 레이블에 비해 논리적 불일치를 크게 줄였습니다. 우리의 코드는 https://github.com/Alpha-Innovator/TrustGeoGen에서 확인할 수 있습니다.

English

Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain noise and self-contradicted information due to the illusion of LLMs. In this paper, we propose a scalable data engine called TrustGeoGen for problem generation, with formal verification to provide a principled benchmark, which we believe lays the foundation for the further development of methods for GPS. The engine synthesizes geometric data through four key innovations: 1) multimodal-aligned generation of diagrams, textual descriptions, and stepwise solutions; 2) formal verification ensuring rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling complexity escalation via recursive state generation and 4) our devised GeoExplore series algorithms simultaneously produce multi-solution variants and self-reflective backtracking traces. By formal logical verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity, along with GeoTrust-test testset. Experiments reveal the state-of-the-art models achieve only 49.17\% accuracy on GeoTrust-test, demonstrating its evaluation stringency. Crucially, models trained on GeoTrust achieve OOD generalization on GeoQA, significantly reducing logical inconsistencies relative to pseudo-label annotated by OpenAI-o1. Our code is available at https://github.com/Alpha-Innovator/TrustGeoGen