ポストトレーニングが生物学的推論モデルをどのように形成するか

要旨

生物学の科学推論モデルは、言語モデルと、DNA、RNA、タンパク質などのマルチモーダルな生物学データで学習された基盤モデルを組み合わせたものである。これらのモデルはポストトレーニングによって構築されるが、各段階が推論と汎化をどのように形成するかは、未だ十分に解明されていない。本研究では、ポストトレーニングがいつ性能を向上させ、いつ過剰特化を引き起こすのかを調査する。ゲノミクス、トランスクリプトミクス、タンパク質にわたって、バックボーン、継続事前学習（CPT）、教師ありファインチューニング（SFT）、強化学習（RL）の制御された変動のもとで100以上の生物学推論モデルを訓練・評価し、ドメイン内（ID）とドメイン外（OOD）の両方の性能を測定した。その結果、各ポストトレーニング段階は一様な向上をもたらすのではなく、それぞれ異なる方法で汎化を再形成することが明らかになった。CPTはモデルを生物学的言語に適合させることで下流性能を向上させる。SFTは一貫してID性能を向上させるが、モデルが訓練分布に適合するにつれてOOD性能は早期にピークに達し、その後低下する。RLを、調整された報酬を持つ強力なSFTチェックポイントに適用すると、OOD性能が向上し、汎化が部分的に回復する。これらの結果は、生物学的推論が追加の教師信号や計算量に比例して単調に向上するわけではないことを示している。むしろ、性能は訓練段階の構成方法に依存する。固定のポストトレーニング予算の下では、最も強いID-OODトレードオフは、短いSFT、より多くのRL割り当て、および段階間の非対称な適応能力から生じる。

English

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.