オートサイエンティスト：長期実行科学実験のための自己組織化エージェントチーム

要旨

科学研究は、仮説生成、実験設計、実行、修正の反復サイクルを通じて進展する。AIエージェントはこのプロセスの一部を自動化できるが、既存のアプローチは通常、単一の研究軌道に従うか、固定された目的を持つ中央プランナーを通じて調整される。その結果、並列的な探索の持続、実験的証拠の変化への適応、長期実験における失敗した方向性の知識の保持に苦慮する。我々は、長期にわたる計算科学実験のための分散型AIエージェントチームであるAutoScientistsを紹介する。エージェントは共有された実験状態を解釈し、有望な仮説を中心にチームに自己組織化し、実験計算リソースを使用する前に提案を批評し、成功と失敗を共有することで重複した探索を削減する。同等の実験予算の下で、AutoScientistsは、生体医工学機械学習、言語モデル学習最適化、タンパク質適合性予測において、先行するAIエージェントよりも優れた性能を示す。BioML-Bench（生体医工学イメージング、タンパク質工学、単一細胞オミクス、創薬を網羅）では、AutoScientistsは24タスクにわたり平均リーダーボードパーセンタイル74.4%を達成し、最強のAIエージェントを+8.33%上回る。GPT学習最適化では、AutoScientistsは目標とする検証ビット・パー・バイトにAutoResearchより1.9倍速く到達し、単一エージェントアプローチでは改善が見られなかったスタートチャンピオンからも改善を発見し続ける（7件対0件の受理された改善）。ProteinGym適合性予測では、AutoScientistsはACE2-Spike結合のための手法を発見し、現在の最先端モデルをSpearman相関係数で+12.5%改善する。ProteinGymの全217アッセイに修正なしで適用した場合、同じ手法は先行技術を+6.5%（Spearman相関係数）上回る。

English

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).