ChatPaper.aiChatPaper

FrontierSmith:大規模合成開放式編程問題

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

May 14, 2026
作者: Runyuan He, Qiuyang Mang, Shang Zhou, Kaiyuan Liu, Hanchen Li, Huanzhi Mao, Qizheng Zhang, Zerui Li, Bo Peng, Lufeng Cheng, Tianfu Fu, Yichuan Wang, Wenhao Chai, Jingbo Shang, Alex Dimakis, Joseph E. Gonzalez, Alvin Cheung
cs.AI

摘要

许多现实世界中的编程挑战具有开放性,且目前尚无已知的最优解决方案。然而,近年来大语言模型(LLM)编码方面的进展主要集中在定义明确的任务上,例如功能实现、错误修复和竞赛编程。开放式编码仍然是LLM的薄弱环节,这主要是因为开放式训练问题既稀缺又构建成本高昂。我们的目标是大规模合成开放式编码问题,以训练更强大的LLM编码器。我们提出了FrontierSmith,这是一个自动化系统,能够从现有封闭式编码任务中迭代演化出开放式问题。从竞赛编程问题出发,FrontierSmith通过改变问题目标、限制输出和泛化输入,生成候选的开放式变体。然后,它使用定量化的思路分歧度量来筛选那些能够引发不同求解者采用真正多样化方法的问题。随后,智能体为幸存下来的候选问题生成测试用例和验证器。在两个开放式编码基准测试中,在我们合成的数据上训练,相较于基础模型获得了实质性提升:Qwen3.5-9B在FrontierCS上提高了+8.82分,在ALE-bench上提高了+306.36分(基于Elo评分的性能);Qwen3.5-27B则分别提高了+12.12分和+309.12分。合成的模型问题还使得智能体需要更多的交互轮次和令牌消耗,这与人工策划的问题相似,表明封闭式种子问题可以作为长周期编码数据的实用起点。
English
Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.