最後の人間による論文：エージェントネイティブな研究成果物

要旨

科学出版は、分岐的で反復的な研究プロセスを直線的な物語に圧縮し、過程で発見された大半の情報を捨象する。この編集作業は二つの構造的コストを課す：一つは「ストーリーテリング税」であり、失敗した実験や棄却された仮説、分岐的な探索プロセスが直線的物語に適合させるため切り捨てられる。もう一つは「エンジニアリング税」であり、査読者にとって十分な散文とエージェントにとって十分な仕様記述との間の隔たりにより、重要な実装詳細が記述されなくなる。人間の読者には許容可能なこれらのコストも、AIエージェントが公表された研究を理解・再現・発展させねばならない場合には重大となる。我々は「エージェントネイティブ研究成果物（ARA）」を提案する。これは物語的な論文を、機械実行可能な研究パッケージに置き換えるプロトコルであり、4つの層を中核に構成される：科学的論理、完全な仕様を伴う実行可能コード、編集過程で捨象される失敗を保存する探索グラフ、そして全ての主張を生の出力に根拠づける証拠である。エコシステムを支える三つのメカニズムを設ける：通常の開発過程で意思決定と行き詰まりを捕捉する「ライブリサーチマネージャ」、従来のPDFやリポジトリをARAに変換する「ARAコンパイラ」、客観的チェックを自動化し人間の査読者が重要性・新規性・センスに集中できる「ARAネイティブ査読システム」である。PaperBenchとRE-Benchによる評価では、ARAは質問応答精度を72.4%から93.7%に、再現成功率を57.4%から64.4%に向上させた。RE-Benchの5つの開かれた拡張課題では、ARAに保存された失敗の痕跡が進捗を加速するが、エージェントの能力次第では、有能なエージェントが過去の実行の枠組みから踏み出すことを制約しうることも示された。

English

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.

最後の人間による論文：エージェントネイティブな研究成果物

The Last Human-Written Paper: Agent-Native Research Artifacts

要旨

Support