AutoResearch AI：邁向以人工智慧驅動的研究自動化，以推動科學發現

摘要

科學研究正被AI系統重塑，這些系統已超越孤立的輔助功能，轉向涵蓋文獻基礎、假設生成、實驗、驗證、報告與修訂等長期工作流程。此轉變標誌著從「科學的任務層級AI」邁向「工作流程層級的研究自動化」。然而，現有系統仍然零散，在自主性、領域範圍、執行環境、驗證機制及人類監督等方面各有差異，且在證據保存、可重複性、弱方向拒絕、溯源追蹤、跨領域穩健性與可問責的科學閉環上仍面臨挑戰。本綜述透過「AutoResearch」此一概念來檢視這些發展——即AI驅動的科學工作流程自動化的發展光譜。其中，「Vibe Research」指涉以提示為基礎的輔助與人類驗證執行的人類引導區域，而新興的AI主導系統則協調發現循環中更大環節，但尚未達到穩健的自主性。我們分析研究系統如何在流程中重新分配控制、證據、執行、驗證與問責，並圍繞五項工作流程條件組織本領域：文獻與研究基礎；假設形成與規劃；實驗與工具使用；反饋、驗證與審查；以及報告與知識傳播。此外，我們進一步綜整AI科學家系統、混合主動協作研究框架、基準測試、領域部署及開源基礎設施。最後，我們提出五個評估維度——新穎性、有效性、影響力、可靠性與溯源——並指出AutoResearch的自主性受領域條件限制，在結構化、可執行且可快速驗證的環境中較為可信，但在具身、延遲、異質、倫理或機構問責的情境中則有所局限。

English

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.