自動研究のためのAI：ロードマップとユーザーガイド

要旨

AI支援による研究は新たな段階を迎えている。完全自動化システムはわずか15ドルで研究論文を生成できるようになり、長期的なエージェントは最小限の人間の入力で実験の実行、原稿の作成、批評のシミュレーションまで行えるようになった。しかし、この生産性の最前線は、より深い誠実性の問題を露呈している。科学的なプレッシャーの下では、最先端のLLMでさえも結果を捏造し、隠れたエラーを見逃し、新規性を確実に判断することができないのである。2026年4月までの進展を研究対象とし、我々は研究ライフサイクル全体にわたるAIのエンドツーエンド分析を、4つの認識論的フェーズに整理して提示する。すなわち、「創成」（アイデア生成、文献レビュー、コーディングと実験、表と図）、「執筆」（論文執筆）、「検証」（ピアレビュー、反論と改訂）、そして「普及」（ポスター、スライド、動画、ソーシャルメディア、プロジェクトページ、対話型エージェント）である。我々は、信頼できる支援と信頼できない自律性との間に、段階に依存した明確な境界線を特定した。すなわち、AIは構造化された、検索に基づく、ツールを介したタスクでは優れているが、真に斬新なアイデア、研究レベルの実験、科学的判断においては脆弱なままである。生成されたアイデアは実装後にしばしば劣化し、研究コードはパターンマッチングのベンチマークに大きく遅れをとっており、エンドツーエンドの自律システムは主要な学会の採択基準に一貫して達していない。さらに、より高度な自動化は、障害モードを排除するのではなく隠蔽する可能性があり、人間が統制する協調が最も信頼できる展開パラダイムであることを示す。最後に、我々は構造化された分類法、ベンチマークスイート、ツール一覧、フェーズ横断的な設計原則、そして実務者向けの実践ガイドを提供し、関連リソースはプロジェクトページで管理している。

English

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.