近端策略优化区:教师在于提示,而非梯度
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
June 16, 2026
作者: Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma
cs.AI
摘要
知识蒸馏将教师模型的能力迁移至小型学生模型,但在小模型情境下表现脆弱:强制学生模仿来自更大教师的逻辑值会导致其过度聚焦于教师最尖锐的模式,从而损害训练语料库之外基准任务族的泛化能力。强化学习通过基于学生自身展开轨迹进行训练,避免了逻辑值模仿。然而,对于所有展开轨迹均失败(产生零优势并被静默丢弃)的问题,将更强的教师响应注入策略梯度会破坏在线策略假设并引发偏移。受维果茨基最近发展区理论启发,我们提出邻近策略优化区域(ZPPO),该方法将教师置于提示而非策略梯度中。对于困难问题,ZPPO构建两种改写提示:二元候选问题将一条正确教师响应与一条错误学生响应配对,作为供学生区分的匿名候选;负面候选问题则将学生的错误展开轨迹聚合为单一提示,以揭示其共享的失败模式。提示重放缓冲区持续循环每个困难问题,直至其毕业(学生对问题的平均展开轨迹准确率达到半数)或因有限容量下的先进先出策略被淘汰,从而在学生当前最近发展区内强化二元候选问题与负面候选问题。在Qwen3.5系列上(学生规模0.8B-9B四种,教师27B),经过视觉-语言模型后训练并在31项基准测试(16项VLM、10项LLM、5项视频)上评估,ZPPO在离线/在线策略蒸馏及GRPO方法中均表现更优,且在小规模模型上提升最为显著。
English
Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.