GENEB：为什么基因组模型难以比较

摘要

基因组基础模型的进展评估面临困难，原因是基准测试碎片化、评估协议不兼容以及任务特定报告方式。因此，模型在优越性或通用性方面的宣称往往难以直接比较。我们提出了GENEB，这是一个大规模诊断性基准，采用统一的基于探针的评估协议（包含少样本场景），对40个基因组基础模型的冻结表示进行评估，涵盖13个功能类别的100项任务。GENEB能够在模型规模、架构、分词策略和预训练数据方面进行受控比较，同时明确揭示任务层面的权衡关系。我们的分析表明，聚合排行榜并不稳定：模型排名在不同任务类别间差异显著，规模带来的收益有限且不一致，而架构和预训练数据对齐的作用常常超过参数数量。这些结果凸显了当前评估实践的局限性，并将GENEB定位为基因组机器学习中进行原则性比较和类别感知模型选择的参考框架。

English

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.