ChatPaper.aiChatPaper

SwS:面向大语言模型推理的自我认知弱点驱动问题生成强化学习

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

June 10, 2025
作者: Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen
cs.AI

摘要

基于可验证奖励的强化学习(RLVR)在训练大型语言模型(LLMs)处理复杂推理任务,如数学问题求解方面已展现出显著成效。RLVR可扩展性的前提是拥有一个答案精确且可验证的高质量题目集。然而,现有以蒸馏为导向的合成数据集中,精心标注的人类数学题目和有限验证答案的稀缺性,限制了它们在强化学习中的有效性。此外,多数题目合成策略不加区分地扩充题目集,未充分考虑模型的能力,导致生成有用问题的效率低下。为解决这一问题,我们提出了一种自我认知的弱点驱动题目合成框架(SwS),该框架系统性地识别模型缺陷并利用这些缺陷进行题目增广。具体而言,我们将弱点定义为模型在强化学习训练过程中通过迭代采样始终未能掌握的问题。随后,我们从这些失败案例中提取核心概念,并合成新题目以在后续的增广训练中强化模型的薄弱环节,使其能够聚焦并逐步克服自身弱点。在不依赖外部知识蒸馏的情况下,我们的框架通过赋予模型在强化学习中自我识别并解决其弱点的能力,实现了稳健的泛化,在八个主流推理基准测试中,7B和32B模型分别取得了平均10.0%和7.7%的性能提升。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.
PDF132June 16, 2025