自动化大语言模型速通基准：复现NanoGPT的优化进展

摘要

大型语言模型（LLMs）的快速发展有望助力科学进步。实现这一目标的关键能力在于重现已有工作的能力。为了评估AI代理在活跃研究领域中重现结果的能力，我们引入了自动化LLM速通基准测试，该测试利用了研究社区在NanoGPT速通竞赛中的贡献，该竞赛旨在以最短时间训练一个GPT-2模型。19个速通任务中的每一个都向代理提供了先前记录的训练脚本，并可选地搭配三种提示格式之一，从伪代码到类似论文的新记录改进描述。这些记录设计上执行迅速，速通改进涵盖了从高层次算法进步到硬件感知优化的多样化代码级变更。这些特性使得该基准测试在改进LLM训练这一前沿问题上既易于接近又贴近现实。我们发现，即便提供了详尽的提示，结合了最新推理能力的LLMs与最先进的脚手架在重现我们基准测试中已知创新时仍面临困难。因此，我们的基准测试为衡量LLMs自动化科学重现能力提供了一个简单且未饱和的指标，这是自主研究代理必备（但非充分）的技能。

English

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

自动化大语言模型速通基准：复现NanoGPT的优化进展

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

摘要

Support