TrustGeoGen：信頼性の高いマルチモーダル幾何学問題解決のためのスケーラブルで形式的検証済みデータエンジン

要旨

数学的幾何問題解決（GPS）は、多様なモダリティ情報の効果的な統合と検証可能な論理的整合性を必要とすることが多い。一般的な問題解決における大規模言語モデルの急速な発展にもかかわらず、特に既存の合成GPSベンチマークが自己検証されておらず、LLMの錯覚によるノイズや自己矛盾した情報を含んでいるという事実を考えると、方法論とベンチマークの両面で未解決のままである。本論文では、正式な検証を伴うスケーラブルなデータエンジン「TrustGeoGen」を提案し、GPSの手法開発の基盤を築く原則的なベンチマークを提供する。このエンジンは、以下の4つの主要な革新を通じて幾何データを合成する：1）図形、テキスト記述、段階的解決策の多モダリティ整合生成、2）ルールに準拠した推論経路を保証する正式な検証、3）再帰的な状態生成を通じて複雑性を段階的に高めるブートストラップメカニズム、4）我々が考案したGeoExploreシリーズのアルゴリズムによる複数解のバリアントと自己反省的なバックトラッキングトレースの同時生成。正式な論理検証により、TrustGeoGenはモダリティの整合性が保証されたGeoTrust-200KデータセットとGeoTrust-testテストセットを生成する。実験では、最先端のモデルがGeoTrust-testで49.17%の精度しか達成できないことが明らかになり、その評価の厳格さが示された。重要なことに、GeoTrustで訓練されたモデルはGeoQAにおいてOOD汎化を達成し、OpenAI-o1によって擬似ラベル付けされたものに比べて論理的不整合を大幅に減少させた。我々のコードはhttps://github.com/Alpha-Innovator/TrustGeoGenで公開されている。

English

Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain noise and self-contradicted information due to the illusion of LLMs. In this paper, we propose a scalable data engine called TrustGeoGen for problem generation, with formal verification to provide a principled benchmark, which we believe lays the foundation for the further development of methods for GPS. The engine synthesizes geometric data through four key innovations: 1) multimodal-aligned generation of diagrams, textual descriptions, and stepwise solutions; 2) formal verification ensuring rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling complexity escalation via recursive state generation and 4) our devised GeoExplore series algorithms simultaneously produce multi-solution variants and self-reflective backtracking traces. By formal logical verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity, along with GeoTrust-test testset. Experiments reveal the state-of-the-art models achieve only 49.17\% accuracy on GeoTrust-test, demonstrating its evaluation stringency. Crucially, models trained on GeoTrust achieve OOD generalization on GeoQA, significantly reducing logical inconsistencies relative to pseudo-label annotated by OpenAI-o1. Our code is available at https://github.com/Alpha-Innovator/TrustGeoGen

TrustGeoGen：信頼性の高いマルチモーダル幾何学問題解決のためのスケーラブルで形式的検証済みデータエンジン

TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

要旨

Support