FrontierSmith: 개방형 코딩 문제의 대규모 생성

초록

많은 실제 코딩 문제는 개방형(open-ended)이며 알려진 최적해가 존재하지 않는다. 그러나 최근 LLM 코딩 분야의 진전은 기능 구현, 버그 수정, 경쟁 프로그래밍과 같은 명확히 정의된 과제에 집중되어 왔다. 개방형 코딩은 LLM에게 여전히 취약점으로 남아 있는데, 이는 주로 개방형 훈련 문제를 구성하는 데 많은 비용과 노력이 필요하기 때문이다. 본 연구의 목표는 대규모로 개방형 코딩 문제를 합성하여 더 강력한 LLM 코더를 훈련하는 것이다. 우리는 기존의 폐쇄형(closed-ended) 코딩 과제로부터 개방형 문제를 반복적으로 진화시키는 자동화 시스템인 FrontierSmith를 제안한다. 경쟁 프로그래밍 문제를 출발점으로 삼아, FrontierSmith는 문제의 목표를 변경하고, 출력을 제한하며, 입력을 일반화함으로써 후보 개방형 변형을 생성한다. 그런 다음 정량적 아이디어 발산 척도를 사용하여 서로 다른 해결자들이 진정으로 다양한 접근 방식을 보이도록 유도하는 문제를 선별한다. 에이전트는 생존한 후보 문제에 대해 테스트 케이스와 검증기를 생성한다. 두 개의 개방형 코딩 벤치마크에서 합성 데이터로 훈련한 결과 기본 모델 대비 상당한 성능 향상을 보였다. Qwen3.5-9B는 FrontierCS에서 +8.82 점, ALE-bench에서 +306.36(Elo 레이팅 기반 성능) 향상되었고, Qwen3.5-27B는 각각 +12.12 및 +309.12 향상되었다. 합성된 문제는 또한 인간이 선별한 문제와 유사하게 에이전트가 더 많은 턴과 토큰을 사용하도록 유도하여, 폐쇄형 시드가 장기적 추론을 요구하는 코딩 데이터의 실용적인 출발점이 될 수 있음을 시사한다.

English

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.