GTR-Turbo: Il Checkpoint Unificato è Segretamente un Insegnante Gratuito per l'Addestramento di VLM Agenti

Abstract

L'apprendimento per rinforzo (RL) multi-turn per agenti multimodali basati su modelli visione-linguaggio (VLM) è ostacolato da ricompense sparse e da un assegnamento del credito a lungo termine. I metodi recenti addensano la ricompensa interrogando un "teacher" che fornisce un feedback a livello di step, ad esempio Guided Thought Reinforcement (GTR) e On-Policy Distillation, ma si basano su modelli teacher costosi e spesso privilegiati, limitando praticità e riproducibilità. Introduciamo GTR-Turbo, un aggiornamento altamente efficiente di GTR, che eguaglia le prestazioni senza addestrare o interrogare un costoso modello teacher. Nello specifico, GTR-Turbo fonde i pesi dei checkpoint prodotti durante l'addestramento RL in corso, per poi utilizzare questo modello fuso come un teacher "gratuito" per guidare il RL successivo tramite fine-tuning supervisionato o distillazione soft dei logit. Questo design elimina la dipendenza da VLM privilegiati (ad es., GPT o Gemini), mitiga il "collasso dell'entropia" osservato in lavori precedenti e mantiene stabile l'addestramento. In varie task di agenti visivi, GTR-Turbo migliora l'accuratezza del modello baseline del 10-30% riducendo contemporaneamente il tempo di addestramento wall-clock del 50% e il costo computazionale del 60% rispetto a GTR.

English

Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.

GTR-Turbo: Il Checkpoint Unificato è Segretamente un Insegnante Gratuito per l'Addestramento di VLM Agenti

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

Abstract

Support