마지막 인간 집필 논문: 에이전트 네이티브 연구 아티팩트

초록

과학 출판은 분기적이고 반복적인 연구 과정을 선형 서사로 압축하며, 그 과정에서 발견된 대부분의 내용을 버린다. 이러한 편집 과정은 두 가지 구조적 비용을 발생시킨다: 첫째, 실패한 실험, 기각된 가설, 분기적 탐색 과정이 선형 서사에 맞추기 위해 버려지는 '스토리텔링 비용'; 둘째, 심사자 충분 설명과 에이전트 충분 명세 사이의 간격으로 인해 중요한 구현 세부사항이 누락되는 '엔지니어링 비용'이다. 인간 독자에게는 참을 만한 이 비용은 AI 에이전트가 게시된 연구를 이해, 재현, 확장해야 할 때 치명적으로 작용한다. 우리는 서사적 논문을 대체할 기계 실행형 연구 패키지인 '에이전트 네이티브 연구 성과물(ARA)' 프로토콜을 소개한다. ARA는 과학적 논리, 완전한 명세를 갖춘 실행 코드, 컴파일 과정에서 버려진 실패 기록을 보존하는 탐색 그래프, 모든 주장을 원시 출력에 근거시키는 증거라는 4개 계층으로 구성된다. 생태계를 지원하는 세 가지 메커니즘은 다음과 같다: 일반 개발 과정에서 결정과 막다른 길을 포착하는 라이브 연구 관리자, 기존 PDF와 저장소를 ARA로 변환하는 ARA 컴파일러, 객관적 검사를 자동화하여 인간 심사자가 중요성, 참신성, 판단에 집중할 수 있게 하는 ARA 네이티브 심사 시스템. PaperBench와 RE-Bench에서 ARA는 질의 응답 정확도를 72.4%에서 93.7%로, 재현 성공률을 57.4%에서 64.4%로 향상시켰다. RE-Bench의 5가지 개방형 확장 과제에서 ARA에 보존된 실패 기록은 진전을 가속화하지만, 에이전트의 역량에 따라 능력 있는 에이전트가 기존 실행 범위를 벗어나는 것을 제한할 수도 있다.

English

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.

마지막 인간 집필 논문: 에이전트 네이티브 연구 아티팩트

The Last Human-Written Paper: Agent-Native Research Artifacts

초록

Support