GTR-Turbo:融合檢查點在智能體視覺語言模型訓練中悄然成為免費導師
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
December 15, 2025
作者: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
cs.AI
摘要
基於視覺語言模型(VLM)建構的多模態代理,其多輪強化學習(RL)長期受稀疏獎勵與長時程信度分配問題制約。近期研究通過查詢能提供步驟級反饋的教師模型來密集化獎勵信號(例如引導思維強化GTR與策略蒸餾法),但這些方法依賴成本高昂且通常具特權的教師模型,限制了實用性與可複現性。我們提出GTR-Turbo,作為GTR的高效升級版,能在無需訓練或查詢昂貴教師模型的情況下達成同等性能。具體而言,GTR-Turbo融合在線RL訓練過程中產生的檢查點權重,並將此融合模型作為「免費」教師,通過監督微調或軟邏輯蒸餾指導後續RL訓練。此設計消除了對特權VLM(如GPT或Gemini)的依賴,緩解了先前研究中觀察到的「熵崩潰」現象,並保持訓練穩定性。在多樣化視覺代理任務中,相較於GTR,GTR-Turbo將基準模型準確率提升10-30%,同時減少50%的實時訓練時間與60%的計算成本。
English
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.