GENEB：なぜゲノムモデルの比較は難しいのか

要旨

ゲノム基盤モデルの進捗評価は、断片的なベンチマーク、互換性のない評価プロトコル、タスク固有の報告のために困難である。その結果、モデル間での優位性や汎用性の主張は、多くの場合直接比較できない。我々はGENEBを導入する。これは13の機能カテゴリにわたる100のタスクにおいて40のゲノム基盤モデルからの凍結表現を、数ショット設定を含む統一されたプローブベースのプロトコルで評価する大規模診断ベンチマークである。GENEBは、モデル規模、アーキテクチャ、トークン化、事前学習データにわたる制御された比較を可能にし、タスクレベルのトレードオフを明示的に明らかにする。我々の分析は、集約リーダーボードが不安定であることを示している。モデルランキングはタスクカテゴリ間で大きく変動し、規模はわずかで一貫性のない利点しか提供せず、アーキテクチャと事前学習の整合性がしばしばパラメータ数を上回る。これらの結果は、現在の評価手法の限界を浮き彫りにし、ゲノム機械学習における原理的な比較とカテゴリ認識型モデル選択のための参照フレームワークとしてGENEBを位置づける。

English

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.