Leren exploreren: het opschalen van agentisch redeneren via exploratiebewuste beleidsoptimalisatie

Samenvatting

Recente vooruitgang in agentische testtijdschaling stelt modellen in staat om omgevingsfeedback te verzamelen alvorens definitieve acties te ondernemen. Een belangrijke beperking van bestaande methoden is dat ze doorgaans ongedifferentieerde exploratiestrategieën hanteren, zonder het vermogen om adaptief te onderscheiden wanneer exploratie daadwerkelijk nodig is. In dit artikel introduceren we een exploratiebewust raamwerk voor reinforcement learning dat LLM-agenten in staat stelt alleen adaptief te exploreren wanneer de onzekerheid hoog is. Onze methode introduceert een fijnmazige beloningsfunctie via variationele inferentie die exploratieve acties expliciet evalueert door hun potentieel om toekomstige besluitvorming te verbeteren te schatten, samen met een exploratiebewust groeperingsmechanisme dat exploratieve acties scheidt van taakvoltooiingsacties tijdens optimalisatie. Door zich te richten op informatielacunes stelt dit ontwerp agenten in staat selectief te exploreren en over te gaan tot uitvoering zodra de taakcontext duidelijk is. Empirisch tonen we aan dat onze aanpak consistente verbeteringen behaalt over een reeks uitdagende tekstgebaseerde en GUI-gebaseerde agent-benchmarks. Code is beschikbaar op https://github.com/HansenHua/EAPO-ICML26 en modellen zijn beschikbaar op https://huggingface.co/hansenhua/EAPO-ICML26.

English

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.