AI 用於自動研究:路線圖與使用者指南
AI for Auto-Research: Roadmap & User Guide
May 18, 2026
作者: Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, Yingshuo Wang, Shaoyuan Xie, Jiachen Liu, Leigang Qu, Shijie Li, Lai Xing Ng, Benoit R. Cottereau, Ziwei Liu, Tat-Seng Chua, Wei Tsang Ooi
cs.AI
摘要
AI輔助研究正跨越一個門檻:全自動系統如今能以低至15美元的價格生成研究論文,而長程自主代理則能在極少人為輸入的情況下執行實驗、撰寫草稿,並模擬審查意見。然而,這一生產力前沿卻暴露出更深層的誠信問題:在科學壓力下,即使是前沿的大型語言模型仍會捏造結果、忽略隱藏錯誤,且無法可靠判斷新穎性。本研究將截至2026年4月的發展納入分析,針對AI在完整研究生命週期中的應用,提出端到端的評估,並按四個認識論階段進行劃分:創造(構想生成、文獻回顧、程式碼與實驗、表格與圖表)、寫作(論文寫作)、驗證(同儕審查、答辯與修訂),以及傳播(海報、簡報、影片、社群媒體、專案網頁與互動代理)。我們發現,在可靠輔助與不可靠自主之間存在一個鮮明且依階段而變的界線:AI在結構化、基於檢索及工具輔助的任務中表現優異,但在真正新穎的構想、研究層級的實驗與科學判斷上仍顯脆弱。生成的構想在實施後往往品質下降,研究程式碼遠落後於模式比對基準,而端到端自主系統尚未能持續達到頂尖會議的接受標準。我們進一步指出,更高的自動化可能掩蓋而非消除失敗模式,使得人類主導的協作成為最可靠的部署範式。最後,我們提供結構化的分類法、基準測試集與工具清單、跨階段設計原則,以及一份從業者導向的操作手冊,相關資源均在我們的專案頁面持續更新。
English
AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.