QueST:激励大语言模型生成高难度问题
QueST: Incentivizing LLMs to Generate Difficult Problems
October 20, 2025
作者: Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei
cs.AI
摘要
大型语言模型在推理任务上已展现出卓越性能,能够解决竞赛级别的编程与数学难题。然而,其扩展性受限于人工标注数据集及大规模、高难度编程问题训练数据的匮乏。现有的竞赛编程数据集仅包含数千至数万道题目。以往合成数据生成方法多依赖于扩充现有指令数据集或从人工标注数据中筛选难题。本文提出QueST框架,创新性地结合难度感知图采样与难度感知拒绝微调技术,直接优化专用生成器以创造高难度编程问题。经训练的生成器在创造有益于下游性能的难题方面,甚至超越了GPT-4o的能力。我们利用QueST生成大规模合成编程问题,进而从具备长链思维推理能力的强教师模型中进行知识蒸馏,或对较小模型进行强化学习,两种场景均验证了其有效性。蒸馏实验显示显著性能提升:在QueST生成的10万道难题上微调Qwen3-8B-base后,其在LiveCodeBench上的表现超越了原版Qwen3-8B;额外加入11.2万例(即2.8万道人工编写问题配以多个合成解答)后,我们的8B模型性能与规模大得多的DeepSeek-R1-671B相当。这些发现表明,通过QueST生成复杂问题,为推进大型语言模型在竞赛编程与推理领域的前沿提供了一条有效且可扩展的路径。
English
Large Language Models have achieved strong performance on reasoning tasks,
solving competition-level coding and math problems. However, their scalability
is limited by human-labeled datasets and the lack of large-scale, challenging
coding problem training data. Existing competitive coding datasets contain only
thousands to tens of thousands of problems. Previous synthetic data generation
methods rely on either augmenting existing instruction datasets or selecting
challenging problems from human-labeled data. In this paper, we propose QueST,
a novel framework which combines difficulty-aware graph sampling and
difficulty-aware rejection fine-tuning that directly optimizes specialized
generators to create challenging coding problems. Our trained generators
demonstrate superior capability compared to even GPT-4o at creating challenging
problems that benefit downstream performance. We leverage QueST to generate
large-scale synthetic coding problems, which we then use to distill from strong
teacher models with long chain-of-thought or to conduct reinforcement learning
for smaller models, proving effective in both scenarios. Our distillation
experiments demonstrate significant performance gains. Specifically, after
fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we
surpass the performance of the original Qwen3-8B on LiveCodeBench. With an
additional 112K examples (i.e., 28K human-written problems paired with multiple
synthetic solutions), our 8B model matches the performance of the much larger
DeepSeek-R1-671B. These findings indicate that generating complex problems via
QueST offers an effective and scalable approach to advancing the frontiers of
competitive coding and reasoning for large language models.