BenchEvolver: 해결 중심 진화를 통한 프론티어 과제 합성

초록

최첨단 거대 언어 모델의 급속한 발전은 광범위한 벤치마크 포화를 초래하여, 기존 데이터셋이 모델 성능을 변별하거나 유용한 훈련 신호를 제공하는 능력을 제한하고 있다. 예를 들어, LiveCodeBench에서 최첨단 모델은 쉬운 분할에서 99% 이상의 Pass@1을 달성하고, 난이도 수준 전반에 걸쳐 평균 90%를 초과하는 Pass@1을 기록한다. 새롭고 도전적인 데이터셋을 구축하려면 일반적으로 상당한 인적 노력이 필요하며, 이는 발전의 병목 현상을 초래한다. 본 논문에서는 기존 코딩 문제를 자동으로 더 어려운 변형으로 진화시키는 솔루션 중심의 진화적 프레임워크인 BenchEvolver를 소개한다. BenchEvolver는 문제를 처음부터 생성하는 대신, 구조화된 변환을 통해 참조 솔루션을 진화시키고, 진화된 솔루션으로부터 대응되는 문제 설명과 테스트를 도출한다. 이러한 설계는 생성 과정을 실행 가능한 의미론에 기반하게 하여, 검증 가능한 정확성을 갖춘 고품질, 다양성, 난이도 높은 작업의 확장 가능한 구축을 가능하게 한다. BenchEvolver를 LiveCodeBench와 SciCode에 적용한 결과, 타당성, 참조 정확성, 다양성을 유지하면서도 훨씬 더 어려워진 진화된 작업을 얻었다. 또한, 진화된 작업과 원본 LCB-v6의 어려운 문제를 결합한 91문제 벤치마크인 LiveCodeBench-Plus를 선별하였으며, 여기서 최첨단 모델의 Pass@1은 27.5%에서 62.6% 사이로 나타나 강력한 코딩 모델 간의 명확한 변별력을 회복하였다. 중요하게도, 진화된 작업은 이를 생성한 모델에게조차 여전히 도전적이어서 자기 개선을 가능하게 한다. 또한, 진화된 LCB 작업에 대한 강화 학습이 보류된 코딩 성능을 향상시킴을 보여준다: gpt-oss-20b의 경우, 시드+진화 훈련이 LCB v6 Hard 및 LCB-Pro Easy에서 각각 +8.7 및 +8.3의 Pass@1 향상을 달성하여, 시드만 사용한 훈련 대비 각각 70.7% 및 34.8% 더 큰 향상을 보였다. 이러한 결과는 BenchEvolver가 포화된 벤치마크를 최첨단 수준의 평가 스위트와 재사용 가능한 훈련 신호로 변환할 수 있음을 보여준다.

English

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.