論文再構成評価：AI執筆論文における表現と虚構性の評価

要旨

本論文は、現代的なコーディングエージェントによって執筆された論文の品質とリスクを定量化する、初の体系的評価フレームワークを提案する。AIによる論文執筆は懸念が高まっているものの、AI執筆論文の品質と潜在的なリスクに関する厳密な評価は依然として限られており、その信頼性について統一的な理解は未だ不足している。我々はPaper Reconstruction Evaluation（PaperRecon）を提案する。これは、既存の論文から概要（overview.md）を作成し、その後エージェントがその概要と最小限の追加リソースに基づいて完全な論文を生成し、その結果を元の論文と比較する評価フレームワークである。PaperReconは、AI執筆論文の評価を、Presentation（表現品質）とHallucination（ hallucination ）という2つの直交する次元に分解する。Presentationは評価基準表を用いて評価され、Hallucinationは元の論文ソースに基づくエージェント評価によって測定される。評価のために、2025年以降に発表された多様な分野のトップカンファレンスからの51本の論文からなるベンチマークPaperWrite-Benchを導入する。実験結果は明確なトレードオフを明らかにする：ClaudeCodeとCodexはともにモデルの進化に伴って改善されるが、ClaudeCodeは平均して1論文あたり10件以上のhallucination という代償を払ってより高い表現品質を達成する一方、Codexはより少ないhallucination を生み出すが、表現品質は低い。本研究成果は、AI駆動型論文執筆のための評価フレームワークを確立し、研究コミュニティ内でのそのリスク理解を深めるための第一歩となる。

English

This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.

論文再構成評価：AI執筆論文における表現と虚構性の評価

Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

要旨

Support