Retrieval is goedkoop, toon me de code: Uitvoerbare meerstapsredenering voor Retrieval-Augmented Generation

Samenvatting

Retrieval-Augmented Generation (RAG) is een standaardbenadering geworden voor kennisintensieve vraagbeantwoording, maar bestaande systemen blijven kwetsbaar bij multi-hop-vragen, waarbij het oplossen van de taak het aaneenrijgen van meerdere retrieval- en redeneerstappen vereist. Belangrijke uitdagingen zijn dat huidige methoden redeneren representeren via vrije-vorm natuurlijke taal, waarbij tussenliggende toestanden impliciet zijn, retrieval-query's kunnen afdwalen van beoogde entiteiten, en fouten worden gedetecteerd door hetzelfde model dat ze produceert, waardoor zelfreflectie een onbetrouwbaar, ongegrond signaal wordt. Wij constateren dat multi-hop-vraagbeantwoording een typische vorm van stapsgewijze berekening is, en dat dit gestructureerde proces nauw aansluit bij hoe code-gespecialiseerde taalmodellen zijn getraind om te werken. Gemotiveerd door dit introduceren we \pyrag, een raamwerk dat multi-hop RAG herformuleert als programmasynthese en -executie. In plaats van vrije-vorm redeneertrajecten representeert \pyrag het redeneerproces als een uitvoerbaar Python-programma over retrieval- en QA-tools, waarbij tussenliggende toestanden worden blootgesteld als variabelen, deterministische feedback wordt gegenereerd door executie, en een inspecteerbaar spoor van het volledige redeneerproces wordt opgeleverd. Deze formulering maakt verder compiler-gebaseerd zelfherstel en executiegestuurde adaptieve retrieval mogelijk zonder enige extra training. Experimenten op vijf QA-benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, en Bamboogle) tonen aan dat \pyrag consequent sterke baseline-methoden overtreft onder zowel trainingsvrije als RL-getrainde instellingen, met bijzonder grote winsten op compositionele multi-hop-datasets. Onze code, data en modellen zijn openbaar beschikbaar op https://github.com/GasolSun36/PyRAG.

English

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.