ChatPaper.aiChatPaper

一统万物的单样本:强化学习规模化中的极致数据效率

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

January 6, 2026
作者: Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li, Yijia Luo, Wenbo Su, Bo Zheng, Pengfei Liu
cs.AI

摘要

大型语言模型(LLM)的推理能力可通过强化学习(RL)被充分激发(OpenAI,2024;DeepSeek-AI等,2025a;Zeng等,2025)。现有基于强化学习的LLM研究成功通常依赖于数千乃至更多的高质量训练样本。本文通过展示单样本学习的显著有效性,对LLM强化学习中的数据需求这一基本假设提出挑战。具体而言,我们提出博识学习框架——通过设计单个能引发多学科影响的训练样本来实现这一目标。我们有三项关键发现:(1)单个经策略性筛选的数学推理样本,结合强化学习即可在物理、化学、生物等多领域产生显著性能提升;(2)对推理至关重要的数学技能揭示了最优博识样本应具备的特征;(3)融合多学科要素的工程化合成样本,其训练效果优于自然场景下的单学科样本。我们的方法在多项推理基准测试中均优于使用大规模数据集的训练效果,表明提升语言模型推理能力的关键可能在于样本质量与设计而非数量。这一研究成果指向训练范式转变——我们称之为"样本工程",即从单纯增加数据量转向对训练样本进行精准设计。
English
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
PDF21January 10, 2026