FrontierSmith: オープンエンドなコーディング問題の大規模合成

要旨

現実世界の多くのコーディング課題は終端開放型（オープンエンド）であり、既知の最適解が存在しない。しかし、LLMコーディングにおける近年の進歩は、機能実装、バグ修正、競技プログラミングといった明確に定義されたタスクに集中してきた。終端開放型コーディングは、LLMにとって依然として弱点であり、その主な理由は、訓練用の終端開放型問題が希少であり、構築に費用がかかることにある。我々の目標は、より強力なLLMコーダーを訓練するために、終端開放型コーディング問題を大規模に合成することである。本稿では、既存の閉じた（クローズドエンド）コーディングタスクから終端開放型問題を反復的に進化させる自動システム、FrontierSmithを紹介する。競技プログラミング問題を出発点として、FrontierSmithは問題の目標を変更し、出力を制約し、入力を一般化することで、終端開放型の候補変種を生成する。次に、定量的なアイデア発散度指標を用いて、異なる解法者から真に多様なアプローチを引き出す問題を選別する。その後、エージェントが選別された候補に対してテストケースと検証器を生成する。2つの終端開放型コーディングベンチマークにおいて、我々の合成データによる訓練はベースモデルに対して顕著な向上をもたらした。Qwen3.5-9BではFrontierCSで+8.82スコア、ALE-benchで+306.36（Eloレーティングベースのパフォーマンス）の向上を達成し、Qwen3.5-27Bではそれぞれ+12.12および+309.12の向上を示した。また、合成問題によりエージェントはより多くのターンとトークンを使用するようになり、これは人手で厳選された問題と類似しており、長期的な視点を持つコーディングデータの実用的な出発点として、閉じた問題シードが有効であることを示唆している。

English

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.