Le Benchmark Automatisé de Speedrunning pour LLM : Reproduction des Améliorations de NanoGPT

Résumé

Les avancées rapides des grands modèles de langage (LLMs) ont le potentiel de contribuer au progrès scientifique. Une capacité essentielle pour cet objectif est la possibilité de reproduire des travaux existants. Pour évaluer la capacité des agents d'IA à reproduire des résultats dans un domaine de recherche actif, nous introduisons le benchmark Automatisé de Speedrunning pour LLM, en exploitant les contributions de la communauté de recherche sur le speedrun NanoGPT, une compétition visant à entraîner un modèle GPT-2 dans le temps le plus court. Chacune des 19 tâches de speedrun fournit à l'agent le script d'entraînement des records précédents, éventuellement accompagné de l'un des trois formats d'indices, allant du pseudocode à des descriptions détaillées des améliorations des nouveaux records. Les records sont conçus pour s'exécuter rapidement, et les améliorations de speedrun englobent divers changements au niveau du code, allant des avancées algorithmiques de haut niveau aux optimisations tenant compte du matériel. Ces caractéristiques rendent le benchmark à la fois accessible et réaliste pour le problème de pointe qu'est l'amélioration de l'entraînement des LLMs. Nous constatons que les LLMs récents combinés avec des échafaudages de pointe peinent à réimplémenter des innovations déjà connues dans notre benchmark, même lorsqu'ils reçoivent des indices détaillés. Notre benchmark fournit ainsi une mesure simple et non saturée de la capacité d'un LLM à automatiser la reproduction scientifique, une compétence nécessaire (mais non suffisante) pour un agent de recherche autonome.

English

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

Le Benchmark Automatisé de Speedrunning pour LLM : Reproduction des Améliorations de NanoGPT

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Résumé

Support