自動化LLMスピードランニングベンチマーク：NanoGPTの改善の再現

要旨

大規模言語モデル（LLM）の急速な進歩は、科学の進歩を支援する可能性を秘めています。この取り組みに向けた重要な能力は、既存の研究を再現する能力です。AIエージェントが活発な研究領域で結果を再現する能力を評価するために、私たちは「Automated LLM Speedrunning Benchmark」を導入しました。これは、NanoGPTスピードラン（GPT-2モデルを最短時間でトレーニングする競技）における研究コミュニティの貢献を活用しています。19のスピードランタスクのそれぞれは、エージェントに以前の記録のトレーニングスクリプトを提供し、オプションで3つのヒント形式（疑似コードから新しい記録の改善点を説明する論文風の記述まで）のいずれかを組み合わせます。記録は設計上迅速に実行され、スピードランの改善は、高レベルのアルゴリズムの進歩からハードウェアを意識した最適化まで、多様なコードレベルの変更を含みます。これらの特徴により、このベンチマークはLLMトレーニングの改善という最先端の問題に対して、アクセスしやすく現実的なものとなっています。私たちは、最先端のスキャフォールドと組み合わせた最近の推論LLMでさえ、詳細なヒントが与えられた場合でも、ベンチマーク内で既知のイノベーションを再実装するのに苦労することを発見しました。したがって、私たちのベンチマークは、科学的再現を自動化するLLMの能力を測定するための、シンプルで未飽和な尺度を提供します。これは、自律的な研究エージェントにとって必要（しかし十分ではない）スキルです。

English

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

自動化LLMスピードランニングベンチマーク：NanoGPTの改善の再現

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

要旨

Support