GeometryZero: グループ対照的ポリシー最適化によるLLMの幾何学問題解決能力の向上

要旨

大規模言語モデル（LLMs）の最近の進展は、特に数学的推論において多様な領域で顕著な能力を示しており、その中でも幾何学問題の解決は補助的な構築が重要な役割を果たす挑戦的な領域として残されている。既存のアプローチは、最適ではない性能を達成するか、大規模なLLMs（例：GPT-4o）に依存しており、膨大な計算コストを伴う。我々は、検証可能な報酬を伴う強化学習（例：GRPO）が、補助的な構築と堅牢な幾何学的推論を効果的に組み合わせたより小さなモデルを訓練するための有望な方向性を提供すると考えている。しかし、GRPOを幾何学的推論に直接適用することは、無条件の報酬に依存するため、基本的な制限があり、無差別で逆効果的な補助的構築を引き起こす。これらの課題に対処するため、我々はGroup Contrastive Policy Optimization（GCPO）を提案する。これは、2つの主要な革新を特徴とする新しい強化学習フレームワークである：（1）Group Contrastive Masking、これは文脈上の有用性に基づいて補助的構築に対して適応的に正または負の報酬信号を提供し、（2）長い推論連鎖を促進する長さ報酬である。GCPOを基盤として、我々はGeometryZeroを開発した。これは、補助的構築をいつ使用するかを適切に判断する、手頃なサイズの幾何学的推論モデルのファミリーである。我々の広範な実証評価（Geometry3K、MathVista）は、GeometryZeroモデルがベースライン（例：GRPO）を一貫して上回り、全てのベンチマークで平均4.29%の改善を達成することを示している。

English

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.

GeometryZero: グループ対照的ポリシー最適化によるLLMの幾何学問題解決能力の向上

GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

要旨

Support