LongCat-Flash-Prover:通过智能体工具集成强化学习推进原生形式化推理
LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning
March 22, 2026
作者: Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai
cs.AI
摘要
我们推出LongCat-Flash-Prover——一款拥有5600亿参数的旗舰级开源专家混合模型,通过智能体工具集成推理技术,在Lean4中推动原生形式化推理的发展。我们将原生形式化推理任务分解为三个独立的形式化能力:自动形式化、证明草图构建与定理证明。为强化这些能力,提出混合专家迭代框架以扩展高质量任务轨迹,包括基于非形式化问题生成形式化命题、直接从命题生成完整证明或引理式证明草图。在智能体强化学习阶段,我们提出分层重要性采样策略优化算法,旨在稳定专家混合模型在长周期任务上的训练。该算法采用梯度掩码策略,同时考虑策略滞后性以及序列与词元层面固有的训练-推理引擎差异。此外,还引入定理一致性与合法性检测机制以消除奖励破解问题。大量评估表明,LongCat-Flash-Prover在自动形式化与定理证明任务上为开源权重模型设立了新标杆:仅需每个问题72次推理预算即在MiniF2F测试集达到97.1%通过率;在更具挑战性的基准测试中,以每个问题不超过220次尝试的成绩解决70.8%的ProverBench与41.5%的PutnamBench问题,显著超越现有开源权重基线。
English
We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.