基于价值预训练与下游反馈

摘要

少量经过验证的目标信息能否引导基础模型昂贵的自监督预训练？传统预训练方法优化的是固定代理目标（如下一词预测），这种机制可能导致计算资源偏离下游任务的核心能力。我们提出价值预训练法（V-Pretraining）：一种基于价值、与模态无关的受控持续预训练方法，通过轻量级任务设计器重塑预训练任务，使每个梯度步的价值最大化。以样本增强的自监督学习为例，该任务设计器会筛选预训练任务（如数据增强方案），确保预训练损失梯度与下游任务（如图像分割）计算的梯度方向一致。这种方法能有效引导预训练过程朝向相关下游能力发展。值得注意的是，预训练模型始终不接触下游任务标签，这些标签仅用于塑造预训练任务。在相同更新预算下，对0.5B-7B语言模型进行价值预训练时，仅需使用12%的GSM8K训练样本作为反馈，就能在推理任务（GSM8K测试Pass@1）上相较传统下一词预测方法实现最高18%的相对提升。在视觉自监督学习中，我们将ADE20K数据集的最优结果提升1.07 mIoU，在降低NYUv2 RMSE的同时提升ImageNet线性分类准确率，并为持续预训练中的令牌效率提升提供了初步证据。

English

Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.

基于价值预训练与下游反馈

Value-Based Pre-Training with Downstream Feedback

摘要

Support