BenchEvolver：通过以解决方案为中心的演化进行前沿任务合成

摘要

前沿大语言模型的快速发展导致了广泛的基准测试饱和，限制了现有数据集区分模型能力或提供有效训练信号的能力。例如，在LiveCodeBench上，前沿模型在简单子集上Pass@1超过99%，且在各难度级别平均Pass@1超过90%。构建新的、具有挑战性的数据集通常需要大量人力投入，这成为进展的瓶颈。我们提出BenchEvolver，一种以解决方案为中心的进化框架，可自动将现有的编程问题转化为更难的变体。BenchEvolver并非从头生成问题，而是通过结构化变换演化参考解决方案，并从演化后的解决方案中推导出相应的题目描述和测试用例。这种设计将生成过程建立在可执行的语义基础上，使得能够规模化构建高质量、多样化且难度适中的任务，并具备可验证的正确性。将BenchEvolver应用于LiveCodeBench和SciCode后，我们获得了难度显著提升的任务，同时保持了有效性、参考正确性和多样性。我们进一步精选出LiveCodeBench-Plus，一个包含91道题目的基准测试，其中融合了演化后的任务和原始LCB-v6中的困难任务，在前沿模型上Pass@1范围为27.5%至62.6%，恢复了对强编码模型的清晰区分能力。重要的是，即使是生成这些任务的模型，对其自身而言演化后的任务仍然具有挑战性，从而支持自我改进。我们还证明，在演化后的LCB任务上进行强化学习可提升留出编码性能：对于gpt-oss-20b模型，种子+演化训练在LCB v6 Hard和LCB-Pro Easy子集上分别获得+8.7和+8.3的Pass@1提升，相较于仅种子训练，增益分别高出70.7%和34.8%。我们的结果表明，BenchEvolver能够将饱和的基准测试转化为前沿级的评估套件和可复用的训练信号。

English

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.