LongCat-Flash-Prover：通过智能体工具集成强化学习推进原生形式化推理

摘要

我们推出LongCat-Flash-Prover——一款拥有5600亿参数的旗舰级开源专家混合模型，通过智能体工具集成推理技术，在Lean4中推动原生形式化推理的发展。我们将原生形式化推理任务分解为三个独立的形式化能力：自动形式化、证明草图构建与定理证明。为强化这些能力，提出混合专家迭代框架以扩展高质量任务轨迹，包括基于非形式化问题生成形式化命题、直接从命题生成完整证明或引理式证明草图。在智能体强化学习阶段，我们提出分层重要性采样策略优化算法，旨在稳定专家混合模型在长周期任务上的训练。该算法采用梯度掩码策略，同时考虑策略滞后性以及序列与词元层面固有的训练-推理引擎差异。此外，还引入定理一致性与合法性检测机制以消除奖励破解问题。大量评估表明，LongCat-Flash-Prover在自动形式化与定理证明任务上为开源权重模型设立了新标杆：仅需每个问题72次推理预算即在MiniF2F测试集达到97.1%通过率；在更具挑战性的基准测试中，以每个问题不超过220次尝试的成绩解决70.8%的ProverBench与41.5%的PutnamBench问题，显著超越现有开源权重基线。

English

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.