Verwaarloosde Gratis Lunch van Na-training: Vooruitgangsvoordeel voor LLM-agenten

Samenvatting

Procesbeloningsmodellen maken fijnmazige, stap-voor-stap evaluatie van LLM's mogelijk, maar het bouwen ervan voor agentische omgevingen blijft buitengewoon moeilijk: langetermijninteracties, onomkeerbare acties en stochastische omgevingsfeedback maken zowel menselijke annotatie als Monte Carlo-schatting op schaal onuitvoerbaar. In dit werk laten we zien dat reinforcement learning (RL) post-training al de ingrediënten levert voor effectieve stap-voor-stap scoring, waardoor de noodzaak voor aparte training van beloningsmodellen volledig vervalt. Concreet leiden we een impliciet voordeel af onder een algemeen stochastisch Markov-beslissingsproces, dat we voortgangsvoordeel noemen -- de log-waarschijnlijkheidsratio tussen het RL-getrainde beleid en het referentiebeleid herstelt exact de optimale voordeelfunctie. Deze formulering maakt het resulterende signaal annotatievrij, domeinagnostisch en beschikbaar als bijproduct van de standaard RL post-training pijplijn. We valideren de effectiviteit van het voortgangsvoordeel in drie verschillende toepassingen: testtijdsschaling, onzekerheidskwantificering en falenstoewijzing op vijf benchmarks en vier modelfamilies. In alle omgevingen presteert het consistent beter dan op vertrouwen gebaseerde baselines en, ondanks dat er geen taakspecifieke training nodig is, overtreft het speciaal getrainde beloningsmodellen. We vullen deze resultaten aan met diepgaandere analyses van kenmerken van het voortgangsvoordeel, en bieden praktische richtlijnen voor adoptie in echte agentische systemen.

English

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.