GENEB：為何基因組模型難以比較

摘要

基因組基礎模型的進展難以評估，原因在於基準測試零散、評估協議不相容以及任務特定的報告方式。因此，不同模型之間關於優越性或通用性的宣稱往往無法直接比較。我們引入GENEB，這是一個大規模的診斷基準測試，在統一的基於探測的協議下（包含少樣本情境），評估40個基因組基礎模型在橫跨13個功能類別的100項任務中的凍結表示。GENEB能夠在明確揭示任務層級取捨的同時，對模型規模、架構、分詞方式及預訓練資料進行受控比較。我們的分析顯示，整體排行榜並不穩定：模型排名在不同任務類別間劇烈變化，規模帶來的提升僅為有限且不一致，而架構與預訓練的對齊往往比參數數量更為關鍵。這些結果凸顯了當前評估實務的限制，並將GENEB定位為基因組機器學習中，用於原則性比較與類別感知模型選擇的參考框架。

English

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.