Auto科学家：面向长期科学实验的自组织智能体团队

摘要

科学研究通过假设生成、实验设计、执行和修正的迭代循环推进。AI智能体可以自动化这一过程的某些部分，但现有方法通常遵循单一研究轨迹，或通过固定目标的中央规划器进行协调。因此，它们难以维持并行探索、在实验证据变化时进行调整，或在长期实验中保留失败方向的记录。我们提出了AutoScientists，一个面向长期计算科学实验的去中心化AI智能体团队。智能体解释共享的实验状态，围绕有前景的假设自组织成团队，在使用实验计算资源前批判性评估提案，并分享成功与失败以减少冗余探索。在匹配的实验预算下，AutoScientists在生物医学机器学习、语言模型训练优化和蛋白质适应性预测方面优于此前AI智能体。在涵盖生物医学成像、蛋白质工程、单细胞组学和药物发现的BioML-Bench基准上，AutoScientists在24项任务中实现了平均排行榜百分位数74.4%，比最强AI智能体提升了8.33%。在GPT训练优化中，AutoScientists达到目标验证集每字节比特数的速度比Autoresearch快1.9倍，并在单智能体方法未发现任何改进的起点冠军基础上持续发现改进（7项接受改进 vs 0项）。在ProteinGym适应性预测中，AutoScientists发现一种ACE2-刺突蛋白结合方法，在斯皮尔曼相关系数上比当前最优模型提升12.5%。将该方法未经修改应用于全部217项ProteinGym检测，其性能比先前最优水平提升6.5%（斯皮尔曼相关系数）。

English

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).