ChatPaper.aiChatPaper

一统万样的样本:强化学习规模化中的极致数据效率

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

January 6, 2026
作者: Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li, Yijia Luo, Wenbo Su, Bo Zheng, Pengfei Liu
cs.AI

摘要

大型語言模型(LLM)的推理能力可通過強化學習(RL)被充分激發(OpenAI, 2024;DeepSeek-AI 等,2025a;Zeng 等,2025)。現有 RL 方法在 LLM 中的成功通常依賴於數千或更多的高質量訓練樣本。本文通過展示單樣本學習的顯著有效性,對 LLM 強化學習中數據需求的基本假設提出挑戰。具體而言,我們提出博學學習框架——通過設計單個訓練樣本即可引發跨學科影響的方法。我們提出三項關鍵發現:(1)單個經策略性選擇的數學推理樣本,結合 RL 訓練能在物理、化學、生物等多領域產生顯著性能提升;(2)對推理至關重要的數學技能揭示了最優博學樣本的特徵;(3)整合多學科要素的工程化合成樣本,其訓練效果優於自然場景中的單學科樣本。我們的方法在多個推理基準測試中均優於使用大規模數據集的訓練效果,表明樣本質量與設計(而非數量)可能是解鎖語言模型增強推理能力的關鍵。這一成果預示著訓練範式的轉變——我們稱之為「樣本工程」,即從單純增加數據量轉向對訓練樣本的精準構建。
English
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
PDF21January 10, 2026