后训练如何塑造生物推理模型

摘要

生物学科学推理模型将语言模型与基于多模态生物数据（包括DNA、RNA和蛋白质）训练的基础模型相结合。这些模型通过后训练构建，然而每个阶段如何塑造推理与泛化能力仍知之甚少。我们研究了后训练何时提升性能、何时引发过度专业化。在基因组学、转录组学和蛋白质组学领域，我们通过控制骨干网络、继续预训练（CPT）、监督微调（SFT）和强化学习（RL）等变量，训练并评估了超过100个生物学推理模型，同时测量其域内（ID）和域外（OOD）性能。研究发现，每个后训练阶段并非带来均匀的性能增益，而是以独特方式重塑泛化能力。CPT通过使模型与生物语言对齐来提升下游性能；SFT持续提高域内性能，但导致域外性能过早达到峰值，随后因模型拟合训练分布而下降；将RL应用于具备对齐奖励的强SFT检查点时，能提升域外性能并部分恢复泛化能力。这些结果表明，生物学推理并非随额外监督或计算投入而单调提升，相反，其性能取决于训练阶段的组合方式。在固定后训练预算下，实现最强域内-域外性能权衡的关键在于：短时SFT、更大的RL资源分配，以及各阶段间不对称的适应能力。

English

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.