Waarom meerstaps gereedschapsgebruik versterkend leren instort en hoe toezichtsignalen het herstellen

Samenvatting

Het gebruik van hulpmiddelen stelt grote taalmodellen (LLM's) in staat om complexe taken uit te voeren, en recente agentische versterkingsleren (RL) methoden tonen veelbelovend voor het verbeteren van modelcapaciteiten. Echter, RL alleen leidt vaak tot instabiliteit of beperkte winst in toolgebruikstaken. In onze experimenten vertonen sommige modellen catastrofale ineenstorting, waarbij de prestaties abrupt dalen en toolaanroepstructuren falen. De analyse onthult dat deze fouten voortkomen uit onverwachte kanspieken in specifieke control tokens, wat de gestructureerde uitvoering verstoort, maar de onderliggende toolgebruikcapaciteit blijft intact, slechts verduisterd door specifieke formaten. Om dit aan te pakken, onderzoeken we systematisch een diverse set van supervisiesignalen, waaronder off-policy supervisie, hint-gebaseerde begeleiding, supervisie met foutieve voorbeelden, en andere, toegepast onder zowel synchrone als interleaved trainingsschema's. We vinden dat het interleaven van supervised fine-tuning (SFT) met RL de stabiliteit aanzienlijk verbetert, maar verminderde prestaties vertoont onder format- en inhoud out-of-distribution (OOD) evaluatie. We analyseren ook de impact van leersnelheden en generalisatie over instellingen. Deze resultaten benadrukken het belang van het begrijpen van RL-fouten en tonen aan hoe diverse supervisiesignalen verkennend leren kunnen begeleiden, wat robuuste training van LLM's voor complexe, meerstap toolgebruikstaken mogelijk maakt. Onze code is beschikbaar op https://github.com/hypasd-art/Tool-RL-Box.

English

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catastrophic collapse, where performance abruptly drops and tool-invocation structures fail. The analysis reveals that these failures stem from unexpected probability spikes in specific control tokens, disrupting structured execution, yet the underlying tool-use capability remains intact, merely obscured by specific formats. To address this, we systematically investigate a diverse set of supervisory signals, including off-policy supervision, hint-based guidance, erroneous example supervision, and others, applied under both synchronous and interleaved training schemes. We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation. We also analyze the impact of learning rates and generalization across settings. These results highlight the importance of understanding RL failures and demonstrate how diverse supervisory signals can guide exploratory learning, enabling robust training of LLMs for complex, multi-step tool-use tasks. Our Code is available at https://github.com/hypasd-art/Tool-RL-Box.