ChatPaper.aiChatPaper

PaperBench:評估AI複製AI研究的能力

PaperBench: Evaluating AI's Ability to Replicate AI Research

April 2, 2025
作者: Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan
cs.AI

摘要

我們推出了PaperBench,這是一個評估AI代理複製頂尖AI研究能力的基準測試。代理必須從零開始複製20篇ICML 2024的Spotlight和Oral論文,包括理解論文貢獻、開發代碼庫以及成功執行實驗。為了客觀評估,我們開發了評分標準,將每個複製任務層次化分解為具有明確評分標準的較小子任務。總計,PaperBench包含8,316個可單獨評分的任務。這些評分標準與每篇ICML論文的作者共同開發,以確保準確性和真實性。為了實現可擴展的評估,我們還開發了一個基於LLM的評判器,自動根據評分標準對複製嘗試進行評分,並通過創建一個專門的評判器基準來評估我們評判器的表現。我們在PaperBench上評估了多個前沿模型,發現表現最佳的測試代理——Claude 3.5 Sonnet(新版)配合開源框架——平均複製得分為21.0%。最後,我們招募了頂尖的機器學習博士生嘗試PaperBench的一部分,發現模型尚未超越人類基線。我們開源了我們的代碼,以促進未來在理解AI代理的AI工程能力方面的研究。
English
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We https://github.com/openai/preparedness{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.

Summary

AI-Generated Summary

PDF362April 3, 2025