GeometryZero：通过群体对比策略优化提升大语言模型的几何解题能力

摘要

近期，大型语言模型（LLMs）在多个领域展现了卓越的能力，尤其在数学推理方面，其中几何问题求解仍是一个充满挑战的领域，辅助构造在其中扮演着至关重要的角色。现有方法要么表现欠佳，要么依赖于庞大的LLMs（如GPT-4o），导致巨大的计算成本。我们认为，采用可验证奖励的强化学习（例如GRPO）为训练更小模型提供了一条有前景的路径，这些模型能有效结合辅助构造与稳健的几何推理。然而，直接将GRPO应用于几何推理存在根本性局限，因其依赖于无条件奖励，导致辅助构造不加区分且适得其反。为应对这些挑战，我们提出了群组对比策略优化（GCPO），一个创新的强化学习框架，具备两大关键创新点：(1) 群组对比掩码，它根据上下文效用自适应地为辅助构造提供正负奖励信号；(2) 长度奖励，鼓励更长的推理链。基于GCPO，我们开发了GeometryZero系列，这是一组规模适中的几何推理模型，能够明智地决定何时采用辅助构造。我们在多个流行的几何基准测试（如Geometry3K、MathVista）上进行了广泛的实证评估，结果表明GeometryZero模型持续超越基线（如GRPO），在所有基准测试中平均提升了4.29%。

English

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.

GeometryZero：通过群体对比策略优化提升大语言模型的几何解题能力

GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

摘要

Support