ChatPaper.aiChatPaper

SwS:強化學習中的自我意識弱點驅動問題合成於大語言模型推理

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

June 10, 2025
作者: Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen
cs.AI

摘要

可驗證獎勵的強化學習(RLVR)在訓練大型語言模型(LLMs)於複雜推理任務上,如數學問題解決,已證明其有效性。RLVR可擴展性的先決條件是擁有一套高質量且答案精確可驗證的問題集。然而,現有以蒸餾為導向的合成數據集中,精心製作的人類標註數學問題及有限驗證答案的稀缺,限制了其在強化學習中的效能。此外,大多數問題合成策略不加區分地擴展問題集,未考慮模型的能力,導致生成有用問題的效率低下。為緩解此問題,我們引入了一種自我感知弱點驅動的問題合成框架(SwS),該框架系統性地識別模型缺陷並利用這些缺陷進行問題擴充。具體而言,我們將弱點定義為模型在RL訓練過程中通過迭代採樣始終未能學會的問題。隨後,我們從這些失敗案例中提取核心概念,並合成新問題以在後續的擴充訓練中強化模型的薄弱環節,使其能夠專注並逐步克服其弱點。在不依賴外部知識蒸餾的情況下,我們的框架通過賦予模型自我識別並解決其在RL中的弱點,實現了強健的泛化能力,在八個主流推理基準上,分別為7B和32B模型帶來了平均10.0%和7.7%的性能提升。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.
PDF132June 16, 2025