L'Uso di Strumenti Consapevole del Budget Consente un'Efficace Scalabilità degli Agenti

Abstract

L'aumento del calcolo durante il test migliora le prestazioni su diversi compiti nei grandi modelli linguistici (LLM), estensione applicata anche agli agenti potenziati da strumenti. Per questi agenti, il scaling coinvolge non solo il "pensare" in token ma anche l'"agire" tramite chiamate a strumenti. Il numero di chiamate a strumenti delimita direttamente l'interazione dell'agente con l'ambiente esterno. Tuttavia, abbiamo riscontrato che concedere semplicemente agli agenti un budget più ampio per le chiamate a strumenti non migliora le prestazioni, poiché essi mancano di "consapevolezza del budget" e raggiungono rapidamente un plateau prestazionale. Per affrontare questo problema, studiamo come scalare efficacemente tali agenti sotto budget espliciti di chiamate a strumenti, concentrandoci sugli agenti di ricerca web. Introduciamo prima il Budget Tracker, un plug-in leggero che fornisce all'agente una consapevolezza continua del budget, abilitando uno scaling semplice ma efficace. Sviluppiamo ulteriormente BATS (Budget Aware Test-time Scaling), un framework avanzato che sfrutta questa consapevolezza per adattare dinamicamente la sua strategia di pianificazione e verifica, decidendo se "approfondire" una pista promettente o "cambiare direzione" verso nuovi percorsi in base alle risorse rimanenti. Per analizzare il scaling costo-prestazioni in modo controllato, formalizziamo una metrica di costo unificata che considera congiuntamente il consumo di token e strumenti. Forniamo il primo studio sistematico sugli agenti vincolati da budget, dimostrando che i metodi consapevoli del budget producono curve di scaling più favorevoli e spingono in avanti la frontiera di Pareto costo-prestazioni. Il nostro lavoro offre intuizioni empiriche verso una comprensione più trasparente e principiata del scaling negli agenti potenziati da strumenti.

English

Scaling test-time computation improves performance across different tasks on large language models (LLMs), which has also been extended to tool-augmented agents. For these agents, scaling involves not only "thinking" in tokens but also "acting" via tool calls. The number of tool calls directly bounds the agent's interaction with the external environment. However, we find that simply granting agents a larger tool-call budget fails to improve performance, as they lack "budget awareness" and quickly hit a performance ceiling. To address this, we study how to scale such agents effectively under explicit tool-call budgets, focusing on web search agents. We first introduce the Budget Tracker, a lightweight plug-in that provides the agent with continuous budget awareness, enabling simple yet effective scaling. We further develop BATS (Budget Aware Test-time Scaling), an advanced framework that leverages this awareness to dynamically adapt its planning and verification strategy, deciding whether to "dig deeper" on a promising lead or "pivot" to new paths based on remaining resources. To analyze cost-performance scaling in a controlled manner, we formalize a unified cost metric that jointly accounts for token and tool consumption. We provide the first systematic study on budget-constrained agents, showing that budget-aware methods produce more favorable scaling curves and push the cost-performance Pareto frontier. Our work offers empirical insights toward a more transparent and principled understanding of scaling in tool-augmented agents.

L'Uso di Strumenti Consapevole del Budget Consente un'Efficace Scalabilità degli Agenti

Budget-Aware Tool-Use Enables Effective Agent Scaling

Abstract

Support