QueST: LLM이 어려운 문제를 생성하도록 유도하기

초록

대규모 언어 모델(LLM)은 추론 과제에서 강력한 성능을 보이며, 경쟁 수준의 코딩 및 수학 문제를 해결하고 있습니다. 그러나 이러한 모델의 확장성은 인간이 레이블링한 데이터셋과 대규모의 도전적인 코딩 문제 훈련 데이터의 부족으로 제한되고 있습니다. 기존의 경쟁 프로그래밍 데이터셋은 수천에서 수만 개의 문제만을 포함하고 있습니다. 이전의 합성 데이터 생성 방법은 기존의 지시 데이터셋을 확장하거나 인간이 레이블링한 데이터에서 도전적인 문제를 선택하는 데 의존했습니다. 본 논문에서는 도전적인 코딩 문제를 생성하기 위해 특화된 생성기를 직접 최적화하는 난이도 인식 그래프 샘플링과 난이도 인식 거부 미세 조정을 결합한 새로운 프레임워크인 QueST를 제안합니다. 우리가 훈련한 생성기는 GPT-4o보다도 뛰어난 도전적인 문제 생성 능력을 보이며, 이는 다운스트림 성능에 이점을 제공합니다. 우리는 QueST를 활용하여 대규모 합성 코딩 문제를 생성하고, 이를 강력한 교사 모델로부터 장기 사고 체인을 통해 증류하거나 더 작은 모델을 위한 강화 학습을 수행하는 데 사용하며, 두 시나리오 모두에서 효과적임을 입증했습니다. 우리의 증류 실험은 상당한 성능 향상을 보여줍니다. 구체적으로, QueST로 생성된 10만 개의 어려운 문제로 Qwen3-8B-base를 미세 조정한 후, LiveCodeBench에서 원래의 Qwen3-8B의 성능을 능가했습니다. 추가로 11만 2천 개의 예시(즉, 2만 8천 개의 인간이 작성한 문제와 여러 합성 솔루션을 짝지은 데이터)를 사용하여, 우리의 8B 모델은 훨씬 더 큰 DeepSeek-R1-671B의 성능과 맞먹는 결과를 보였습니다. 이러한 결과는 QueST를 통해 복잡한 문제를 생성하는 것이 대규모 언어 모델의 경쟁 프로그래밍 및 추론의 한계를 넘어서는 효과적이고 확장 가능한 접근 방식을 제공함을 시사합니다.

English

Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.

QueST: LLM이 어려운 문제를 생성하도록 유도하기

QueST: Incentivizing LLMs to Generate Difficult Problems

초록

Support