GENEB: 유전체 모델이 비교하기 어려운 이유

초록

유전체 기반 모델의 발전은 파편화된 벤치마크, 호환되지 않는 평가 프로토콜, 과제별 보고 방식으로 인해 평가하기 어렵다. 그 결과, 모델 간 우월성이나 일반성에 대한 주장은 종종 직접 비교가 불가능하다. 본 연구에서는 GENEB를 소개한다. GENEB는 통합된 프로빙 기반 프로토콜(소수 샷(few-shot) 방식을 포함) 하에 13가지 기능 범주에 걸친 100개 과제에서 40개의 유전체 기반 모델로부터 추출된 고정 표현(frozen representation)을 평가하는 대규모 진단용 벤치마크이다. GENEB는 모델 규모, 구조, 토큰화 및 사전 학습 데이터에 걸친 통제된 비교를 가능하게 하면서, 과제 수준의 상충 관계를 명시적으로 드러낸다. 분석 결과, 종합 리더보드는 불안정하다. 즉, 모델 순위는 과제 범주에 따라 급격히 변하며, 규모 증가는 미미하고 일관되지 않은 이점만을 제공하고, 구조와 사전 학습 간의 정렬이 종종 파라미터 수보다 더 큰 영향을 미친다. 이러한 결과는 현재 평가 관행의 한계를 강조하며, GENEB를 유전체 머신러닝에서 원칙적 비교와 범주별 모델 선택을 위한 참조 프레임워크로 자리매김한다.

English

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.