幾何零點：通過群對比策略優化提升大語言模型的幾何解題能力

摘要

近期，大型语言模型（LLMs）的进展在多个领域展现了显著能力，尤其在数学推理方面，其中几何问题求解仍是一个挑战性领域，辅助构造在其中扮演着关键角色。现有方法要么表现欠佳，要么依赖于大规模LLMs（如GPT-4o），导致巨大的计算成本。我们认为，结合可验证奖励的强化学习（如GRPO）为训练较小模型提供了有前景的方向，这些模型能有效融合辅助构造与稳健的几何推理。然而，直接将GRPO应用于几何推理存在根本性局限，因其依赖于无条件奖励，导致辅助构造不加区分且可能适得其反。针对这些挑战，我们提出了群组对比策略优化（GCPO），一种新颖的强化学习框架，具备两大创新点：(1) 群组对比掩码，根据上下文效用自适应地为辅助构造提供正负奖励信号；(2) 长度奖励，促进更长的推理链。基于GCPO，我们开发了GeometryZero，一系列规模适中的几何推理模型，它们能明智地决定何时采用辅助构造。我们在多个流行几何基准（Geometry3K, MathVista）上的广泛实证评估表明，GeometryZero模型持续超越基线（如GRPO），在所有基准上平均提升4.29%。

English

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.