低温基准:面向冰冻圈应用的基础模型性能评估
Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
March 2, 2026
作者: Saurabh Kaushik, Lalit Maurya, Beth Tellman
cs.AI
摘要
地理基础模型(GFMs)已在涵盖多领域的地球观测任务中完成评估,展现出即使使用稀疏标签也能生成可靠地图的强大潜力。然而,由于缺乏合适的评估数据集,针对冰冻圈应用的GFM基准测试研究仍较为有限。为填补这一空白,我们推出了Cryo-Bench——一个专为评估GFM在关键冰冻圈要素上表现而构建的基准测试集。该基准涵盖碎屑覆盖冰川、冰川湖、海冰及冰崩前缘等要素,横跨多类传感器与广阔地理区域。我们通过对比14种GFM与UNet、ViT基线模型,评估了其优势、局限及最优使用策略。在编码器冻结条件下,UNet在Cryo-Bench包含的五项评估数据集中取得了66.38的最高平均交并比,TerraMind以64.02紧随其后。在少样本场景(10%输入数据)下,DOFA与TerraMind等GFM模型以59.53、56.62的交并比表现超越UNet的56.60。当对GFM进行全参数微调时,我们发现不同数据集和模型间存在性能波动,但结合学习率调整的微调策略能显著提升模型性能。例如在GLID和CaFFe两个代表性数据集上的评估显示平均相对提升达12.77%。尽管预训练数据中冰冻圈表征极少,GFMs仍展现出显著的领域适应能力并在各任务中产生有效结果。基于研究结论,我们建议通过编码器微调与超参数优化组合以获得最优性能,而在用户需要快速获取结果且无需大量实验时可采用冻结编码器方案。(GitHub项目地址:https://github.com/Sk-2103/Cryo-Bench)
English
Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce Cryo-Bench, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of 66.38, followed by TerraMind at 64.02 across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of 59.53, 56.62, and 56.60, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of 12.77\%. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(https://github.com/Sk-2103/Cryo-Bench{GitHub}).