TrustGeoGen:面向可信多模态几何问题求解的可扩展形式化验证数据引擎
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
April 22, 2025
作者: Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, Yu Qiao
cs.AI
摘要
数学几何问题求解(GPS)通常需要有效整合多模态信息并确保逻辑推理的可验证性。尽管大语言模型在通用问题解决方面发展迅速,但在方法论和基准测试方面仍存在未解难题,尤其是考虑到现有的合成GPS基准往往无法自我验证,且因大语言模型的幻觉而包含噪声和自相矛盾的信息。本文提出了一种名为TrustGeoGen的可扩展数据引擎,用于问题生成,并通过形式化验证提供原则性基准,我们相信这将为GPS方法的进一步发展奠定基础。该引擎通过四项关键创新合成几何数据:1)多模态对齐的图表、文本描述及分步解答生成;2)确保推理路径符合规则的形式化验证;3)通过递归状态生成实现复杂度提升的引导机制;4)我们设计的GeoExplore系列算法同时生成多解变体及自我反思的回溯轨迹。通过形式逻辑验证,TrustGeoGen生成了保证模态完整性的GeoTrust-200K数据集及GeoTrust-test测试集。实验表明,当前最先进的模型在GeoTrust-test上仅达到49.17%的准确率,证明了其评估的严格性。重要的是,在GeoTrust上训练的模型在GeoQA上实现了分布外泛化,显著减少了相对于OpenAI-o1伪标注的逻辑不一致性。我们的代码可在https://github.com/Alpha-Innovator/TrustGeoGen获取。
English
Mathematical geometric problem solving (GPS) often requires effective
integration of multimodal information and verifiable logical coherence. Despite
the fast development of large language models in general problem solving, it
remains unresolved regarding with both methodology and benchmarks, especially
given the fact that exiting synthetic GPS benchmarks are often not
self-verified and contain noise and self-contradicted information due to the
illusion of LLMs. In this paper, we propose a scalable data engine called
TrustGeoGen for problem generation, with formal verification to provide a
principled benchmark, which we believe lays the foundation for the further
development of methods for GPS. The engine synthesizes geometric data through
four key innovations: 1) multimodal-aligned generation of diagrams, textual
descriptions, and stepwise solutions; 2) formal verification ensuring
rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling
complexity escalation via recursive state generation and 4) our devised
GeoExplore series algorithms simultaneously produce multi-solution variants and
self-reflective backtracking traces. By formal logical verification,
TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity,
along with GeoTrust-test testset. Experiments reveal the state-of-the-art
models achieve only 49.17\% accuracy on GeoTrust-test, demonstrating its
evaluation stringency. Crucially, models trained on GeoTrust achieve OOD
generalization on GeoQA, significantly reducing logical inconsistencies
relative to pseudo-label annotated by OpenAI-o1. Our code is available at
https://github.com/Alpha-Innovator/TrustGeoGenSummary
AI-Generated Summary