基于价值预训练与下游反馈
Value-Based Pre-Training with Downstream Feedback
January 29, 2026
作者: Shuqi Ke, Giulia Fanti
cs.AI
摘要
少量经过验证的目标信息能否引导基础模型昂贵的自监督预训练?传统预训练方法优化的是固定代理目标(如下一词预测),这种机制可能导致计算资源分配与下游目标能力需求不匹配。我们提出基于价值的模态无关控制式持续预训练方法V-Pretraining:通过轻量级任务设计器动态调整预训练任务,使每个梯度步的价值最大化。以数据增强下的自监督学习为例,该任务设计器会选取那些预训练损失梯度与下游任务(如图像分割)梯度方向一致的预训练任务(如增强策略),从而将预训练导向相关下游能力。值得注意的是,预训练模型始终不接触下游任务标签,这些标签仅用于塑造预训练任务。在相同更新预算下,对0.5B-7B语言模型进行V-Pretraining,仅需使用12%的GSM8K训练样本作为反馈,即可在推理任务(GSM8K测试Pass@1)上相较标准下一词预测实现最高18%的相对提升。在视觉自监督学习中,我们将ADE20K数据集上的最优结果提升1.07 mIoU,在降低NYUv2 RMSE的同时保持ImageNet线性评估精度,并为持续预训练中的令牌效率提升提供了初步证据。
English
Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.