GTR-Turbo:融合检查点在智能体化视觉语言模型训练中悄然扮演免费教师的角色
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
December 15, 2025
作者: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
cs.AI
摘要
基于视觉语言模型(VLM)的多模态智能体在进行多轮强化学习(RL)时,常因奖励稀疏和长期信用分配问题而受阻。现有方法通过调用提供步骤级反馈的教师模型来稠密化奖励(例如引导思维强化GTR和在线策略蒸馏),但依赖成本高昂且往往具有特权权限的教师模型,限制了实用性与可复现性。我们提出GTR-Turbo——GTR的高效升级方案,在不训练或调用昂贵教师模型的情况下实现同等性能。具体而言,GTR-Turbo融合正在进行的RL训练过程中产生的检查点权重,随后将该融合模型作为"免费"教师,通过监督微调或软逻辑蒸馏指导后续RL训练。这一设计消除了对特权VLM(如GPT或Gemini)的依赖,缓解了先前工作中观察到的"熵崩塌"现象,并保持训练稳定性。在多样化视觉智能体任务中,GTR-Turbo将基线模型准确率提升10-30%,同时相比GTR将训练时间缩短50%,计算成本降低60%。
English
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.