LongCat-Flash-Prover：透過代理工具整合式強化學習推進原生形式化推理

摘要

我們推出旗艦級開源模型LongCat-Flash-Prover——一個擁有5600億參數的混合專家模型，通過工具集成推理代理技術在Lean4中推進原生形式推理能力。我們將原生形式推理任務解構為三個獨立的形式化能力：自動形式化、證明草圖構建與定理證明。為強化這些能力，我們提出混合專家迭代框架以擴展高質量任務軌跡，包括根據非形式化問題生成形式化命題、直接從命題生成完整證明或引理式證明草圖。在代理強化學習階段，我們提出分層重要性採樣策略優化算法，旨在穩定混合專家模型在長週期任務上的訓練。該算法採用梯度掩碼策略，同時考慮策略陳舊性以及序列與詞元層面的訓練-推理引擎固有差異。此外，我們還引入定理一致性與合法性檢測機制以消除獎勵欺詐問題。大量實驗表明，LongCat-Flash-Prover在自動形式化與定理證明任務上為開源權重模型設立了新標杆：憑藉卓越的樣本效率，在MiniF2F測試集上僅用每題72次推理預算即達成97.1%通過率；在更具挑戰性的基準測試中，以每題不超過220次嘗試解決了ProverBench的70.8%和PutnamBench的41.5%問題，顯著超越現有開源權重基準模型。

English

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

LongCat-Flash-Prover：透過代理工具整合式強化學習推進原生形式化推理

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

摘要

Support