LOCA-bench: Benchmarken van Taalagenten onder Controleerbare en Extreme Contextgroei

Samenvatting

Grote taalmodellen (LLM's) worden steeds beter in staat om langdurige, real-world taken uit te voeren. Naarmate de hoeveelheid context echter groeit, neemt hun betrouwbaarheid vaak af, een fenomeen dat bekend staat als "contextrot". Bestaande benchmarks voor lange context richten zich voornamelijk op instellingen met één stap, die het vermogen van een model evalueren om informatie op te halen uit een lang fragment. In realistische scenario's moeten LLM's echter vaak functioneren als agents die omgevingen verkennen, instructies en plannen volgen, nuttige informatie extraheren en correcte acties voorspellen binnen een dynamisch groeiende context. Om taalagentschappen in dergelijke settings te beoordelen, introduceren we LOCA-bench (een benchmark voor LOng-Context Agents). Gegeven een taakprompt, benut LOCA-bench geautomatiseerde en schaalbare controle van omgevingstoestanden om de contextlengte van het agent te reguleren. Dit ontwerp stelt LOCA-bench in staat om de contextlengte op een gecontroleerde manier potentieel oneindig uit te breiden, terwijl de onderliggende taaksemantiek ongewijzigd blijft. LOCA-bench evalueert taalagentschappen als een combinatie van modellen en scaffolds, inclusief verschillende contextbeheerstrategieën. Hoewel de prestaties van agents over het algemeen verslechteren naarmate de omgevingstoestanden complexer worden, kunnen geavanceerde contextbeheertechnieken het algehele slagingspercentage aanzienlijk verbeteren. We maken LOCA-bench open source om een platform te bieden voor het evalueren van modellen en scaffolds in lang-context, agent-gebaseerde scenario's: https://github.com/hkust-nlp/LOCA-bench

English

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench

LOCA-bench: Benchmarken van Taalagenten onder Controleerbare en Extreme Contextgroei

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Samenvatting

Support