自動化LLM極速挑戰基準：重現NanoGPT的改進

摘要

大型語言模型（LLMs）的快速進步具有協助科學進展的潛力。實現這一目標的關鍵能力在於重現現有工作的能力。為了評估AI代理在活躍研究領域中重現結果的能力，我們引入了自動化LLM速通基準測試，利用研究社群在NanoGPT速通競賽中的貢獻，該競賽旨在以最短時間訓練一個GPT-2模型。這19個速通任務中的每一個都為代理提供了先前記錄的訓練腳本，並可選擇性地搭配三種提示格式之一，範圍從偽代碼到類似論文的新記錄改進描述。這些記錄設計上執行迅速，且速通改進涵蓋了多樣的代碼層面變更，從高層次算法進步到硬體感知的優化。這些特性使得該基準測試在改進LLM訓練的前沿問題上既易於接近又具現實性。我們發現，結合最先進框架的最新推理LLMs在我們的基準測試中難以重新實現已知的創新，即使提供了詳細的提示。因此，我們的基準測試提供了一個簡單、未飽和的衡量標準，用於評估LLMs自動化科學重現的能力，這是自主研究代理必要（但非充分）的技能。

English

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

自動化LLM極速挑戰基準：重現NanoGPT的改進

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

摘要

Support