ChatPaper.aiChatPaper

QueST:激勵大型語言模型生成高難度問題

QueST: Incentivizing LLMs to Generate Difficult Problems

October 20, 2025
作者: Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei
cs.AI

摘要

大型語言模型在推理任務上展現了卓越的性能,能夠解決競賽級別的編程與數學問題。然而,其可擴展性受限於人工標註數據集以及缺乏大規模、高難度的編程問題訓練數據。現有的競賽編程數據集僅包含數千至數萬個問題。先前的合成數據生成方法依賴於擴展現有的指令數據集或從人工標註數據中挑選難題。本文提出QueST,一個新穎的框架,結合了難度感知的圖採樣與難度感知的拒絕微調,直接優化專用生成器以創造高難度的編程問題。我們訓練的生成器在創造有益於下游性能的難題方面,展現出甚至超越GPT-4o的卓越能力。我們利用QueST生成大規模的合成編程問題,隨後用於從具有長思維鏈的強教師模型進行知識蒸餾,或對較小模型進行強化學習,在兩種情境下均證明了其有效性。我們的蒸餾實驗顯示了顯著的性能提升。具體而言,在QueST生成的10萬個難題上微調Qwen3-8B-base後,我們在LiveCodeBench上的表現超越了原始Qwen3-8B。通過額外增加11.2萬個樣本(即2.8萬個人類編寫的問題配以多個合成解決方案),我們的8B模型達到了遠大於其的DeepSeek-R1-671B的性能水平。這些發現表明,通過QueST生成複雜問題,為推進大型語言模型在競賽編程與推理領域的前沿提供了一種有效且可擴展的方法。
English
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.
PDF282October 21, 2025