AutoScientists：面向長期科學實驗的自組織代理團隊

摘要

科學研究的推進依賴於假說生成、實驗設計、執行與修正的反覆迭代循環。AI智能體能自動化此流程的部分環節，但現有方法通常遵循單一研究軌跡，或透過具有固定目標的中央規劃者進行協調。因此，它們難以維持並行探索、無法隨實驗證據變化而調整，也無法在長期實驗中保留失敗方向的知識。我們提出AutoScientists，這是一個由AI智能體組成的去中心化團隊，專為長期運行的計算科學實驗而設計。這些智能體能解讀共享的實驗狀態，圍繞具潛力的假說自組織成團隊，在動用實驗計算資源前審查提案，並分享成功與失敗經驗以減少冗餘探索。在匹配的實驗預算下，AutoScientists在生物醫學機器學習、語言模型訓練優化及蛋白質適應性預測等領域，均優於先前的AI智能體。在涵蓋生物醫學影像、蛋白質工程、單細胞組學與藥物發現的BioML-Bench基準上，AutoScientists在24項任務中達到平均排行榜百分位74.4%，較最強的AI智能體提升8.33%。在GPT訓練優化方面，AutoScientists達到目標驗證位元組位元率（validation bits-per-byte）的速度比Autoresearch快1.9倍，並能從初始冠軍模型中持續發現改進（7項獲接受的改進，而單一智能體方法為0）。在ProteinGym適應性預測中，AutoScientists發現一種針對ACE2-刺突蛋白結合的方法，在斯皮爾曼相關係數上較當前最佳模型提升12.5%。將該方法未經修改應用於所有217項ProteinGym檢測，其表現較先前最佳技術提升6.5%（斯皮爾曼相關係數）。

English

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).