GeometryZero: 그룹 대조 정책 최적화를 통한 LLM 기하 문제 해결 능력 향상

초록

대규모 언어 모델(LLMs)의 최근 발전은 다양한 분야, 특히 수학적 추론에서 놀라운 능력을 보여주었으며, 그 중에서도 기하학 문제 해결은 보조 구성이 핵심적인 역할을 하는 어려운 영역으로 남아 있습니다. 기존의 접근 방식은 최적의 성능을 달성하지 못하거나 GPT-4o와 같은 대규모 LLMs에 의존하여 막대한 계산 비용을 초래합니다. 우리는 검증 가능한 보상을 통한 강화 학습(예: GRPO)이 보조 구성과 견고한 기하학적 추론을 효과적으로 결합한 소규모 모델을 훈련하는 유망한 방향을 제시한다고 주장합니다. 그러나 GRPO를 기하학적 추론에 직접 적용하는 것은 무조건적인 보상에 의존하기 때문에 근본적인 한계를 가지고 있으며, 이는 무차별적이고 역효과를 일으키는 보조 구성을 초래합니다. 이러한 문제를 해결하기 위해 우리는 두 가지 주요 혁신을 특징으로 하는 새로운 강화 학습 프레임워크인 그룹 대조 정책 최적화(GCPO)를 제안합니다: (1) 문맥적 유용성에 기반하여 보조 구성에 대해 긍정적 또는 부정적 보상 신호를 적응적으로 제공하는 그룹 대조 마스킹, 그리고 (2) 더 긴 추론 체인을 촉진하는 길이 보상입니다. GCPO를 기반으로 우리는 보조 구성을 언제 사용할지 신중하게 결정하는 합리적인 크기의 기하학적 추론 모델인 GeometryZero 제품군을 개발했습니다. Geometry3K, MathVista와 같은 인기 있는 기하학 벤치마크에 대한 광범위한 실험적 평가를 통해 GeometryZero 모델이 기준선(예: GRPO)을 지속적으로 능가하며 모든 벤치마크에서 평균 4.29%의 개선을 달성함을 입증했습니다.

English

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.