ChatPaper.aiChatPaper

LongCat-Flash-Prover:透過代理工具整合式強化學習推進原生形式化推理

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

March 22, 2026
作者: Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai
cs.AI

摘要

我們推出旗艦級開源模型LongCat-Flash-Prover——一個擁有5600億參數的混合專家模型,通過工具集成推理代理技術在Lean4中推進原生形式推理能力。我們將原生形式推理任務解構為三個獨立的形式化能力:自動形式化、證明草圖構建與定理證明。為強化這些能力,我們提出混合專家迭代框架以擴展高質量任務軌跡,包括根據非形式化問題生成形式化命題、直接從命題生成完整證明或引理式證明草圖。在代理強化學習階段,我們提出分層重要性採樣策略優化算法,旨在穩定混合專家模型在長週期任務上的訓練。該算法採用梯度掩碼策略,同時考慮策略陳舊性以及序列與詞元層面的訓練-推理引擎固有差異。此外,我們還引入定理一致性與合法性檢測機制以消除獎勵欺詐問題。大量實驗表明,LongCat-Flash-Prover在自動形式化與定理證明任務上為開源權重模型設立了新標杆:憑藉卓越的樣本效率,在MiniF2F測試集上僅用每題72次推理預算即達成97.1%通過率;在更具挑戰性的基準測試中,以每題不超過220次嘗試解決了ProverBench的70.8%和PutnamBench的41.5%問題,顯著超越現有開源權重基準模型。
English
We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
PDF632March 25, 2026