BenchEvolver：透過以解為中心演化之前沿任務合成

摘要

前沿大型語言模型的快速進展導致廣泛的基準測試飽和（benchmark saturation），限制了現有資料集區分模型能力或提供有效訓練訊號的作用。例如，在 LiveCodeBench 上，前沿模型在簡單分割（easy splits）中達到超過 99% 的 Pass@1，且在各難度等級中平均超過 90% 的 Pass@1。構建新的、具挑戰性的資料集通常需要大量人力，形成進展瓶頸。我們提出 BenchEvolver，這是一個以解答為中心的演化框架（solution-centric evolutionary framework），能自動將現有程式問題轉化為更困難的變體。BenchEvolver 並非從零生成問題，而是透過結構化轉換來演化參考解答，並從演化後的解答推導出對應的題目陳述與測試。此設計將生成過程奠基於可執行的語意（executable semantics），從而能大規模構建高品質、多樣化且難度較高的任務，同時確保正確性可驗證。將 BenchEvolver 應用於 LiveCodeBench 與 SciCode，我們獲得了難度顯著提升的演化任務，同時保持有效性、參考解答正確性與多樣性。我們進一步整理出 LiveCodeBench-Plus，這是一個包含 91 道問題的基準測試，結合了演化任務與原始 LCB-v6 中的困難任務，其中前沿模型的 Pass@1 範圍落在 27.5% 到 62.6%，恢復了對強大程式模型的清晰區分能力。重要的是，演化任務即使對生成它們的模型本身仍具挑戰性，因此能支援自我改進。我們更進一步證明，在演化後的 LCB 任務上進行強化學習（RL）能提升留出式（held-out）程式設計表現：對於 gpt-oss-20b 模型，種子訓練加上演化訓練（seed+evolved training）在 LCB v6 Hard 與 LCB-Pro Easy 上分別達到 +8.7 與 +8.3 的 Pass@1 增益，較僅使用種子訓練的增益分別高出 70.7% 與 34.8%。我們的結果顯示，BenchEvolver 能將飽和的基準測試轉化為前沿等級的評測套件與可重複使用的訓練訊號。

English

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.