近端策略優化區間：教師在提示中，而非梯度中

摘要

知識蒸餾能將教師模型的能力傳遞給小型學生模型，但在小型學生模型的情境下卻顯得脆弱：強迫學生模仿來自規模更大之教師模型的邏輯值，會使其過度集中於教師最尖銳的模式，從而損害其對訓練語料庫以外之基準測試家族的泛化能力。強化學習透過在學生模型自身的軌跡上進行訓練，避免了邏輯值模仿。然而，對於那些所有軌跡皆失敗（導致優勢為零且被靜默捨棄）的問題而言，將更強教師模型的反饋注入策略梯度會破壞在策略假設並引發偏移。我們提出「近側發展區策略優化（ZPPO）」，靈感來自維高斯基的近側發展區理論，此方法將教師侷限於提示中，而非策略梯度內。針對困難問題，ZPPO建構兩種重新表述的提示：二元候選項納入問題（BCQ）將一個正確的教師回應與一個錯誤的學生回應配對，作為學生必須辨別的匿名候選項；以及負面候選項納入問題（NCQ）將學生的錯誤軌跡彙整為單一提示，以浮現其共同的失敗模式。提示重播緩衝區會持續重播每個困難問題，直到該問題畢業（學生的平均軌跡準確率達到一半），或在容量有限下先進先出（FIFO）被逐出，從而在學生當前的近側發展區內放大BCQ與NCQ的效果。在Qwen3.5家族的四種學生規模（0.8B至9B）下，搭配27B的教師模型，經後訓練成為視覺語言模型，並在31項基準測試套件（16項VLM、10項LLM、5項影片）上進行評估，ZPPO的表現優於離策略／在策略蒸餾與GRPO，其中在最小規模下獲得最大幅度的提升。

English

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.