BioBench:超越ImageNet的科学机器学习基准新蓝图
BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks
November 20, 2025
作者: Samuel Stevens
cs.AI
摘要
ImageNet-1K线性探針遷移準確度雖仍是視覺表徵質量的默認代理指標,但其對科學影像的預測效能已然失準。基於46個現代視覺模型檢查點的測試表明,ImageNet top-1準確度僅能解釋生態學任務中34%的方差差異,且對準確率超過75%的模型出現30%的錯誤排名。我們推出BioBench——一個能捕捉ImageNet遺漏信息的開放式生態視覺基準。該基準整合了9項公開的應用驅動任務,涵蓋4個生物分類界和6種採集模式(無人機RGB影像、網絡視頻、顯微圖像、原位與標本照片、相機陷阱幀),總計310萬張圖像。通過單一Python接口即可完成數據下載、凍結骨幹網絡的輕量級分類器擬合,並輸出類別平衡宏觀F1值(另包含FishNet與FungiCLEF的領域指標);在A6000 GPU上,ViT-L模型的評估可在6小時內完成。BioBench不僅為生態學計算機視覺提供了新的信號參照,更為構建跨領域可靠「科學人工智能」基準樹立了模板範式。代碼與預測結果見https://github.com/samuelstevens/biobench,完整結果載於https://samuelstevens.me/biobench。
English
ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.