低温基准：面向冰冻圈应用的基础模型性能评估

摘要

地理基础模型（GFMs）已在涵盖多领域的地球观测任务中完成评估，展现出即使标注稀疏也能生成可靠地图的强大潜力。然而，由于缺乏适宜的评价数据集，针对冰冻圈应用的GFM基准测试仍存在局限。为此，我们推出Cryo-Bench基准平台，该系统专为评估GFM在关键冰冻圈要素上的性能而构建，涵盖碎屑覆盖冰川、冰川湖、海冰与冰裂前沿等目标，涉及多源传感器及广泛地理区域。通过对比14种GFM与UNet、ViT基线模型，我们系统评估了其优势、局限及最优使用策略。在编码器冻结条件下，UNet在Cryo-Bench包含的五类评估数据集中取得最高平均交并比（mIoU）66.38%，TerraMind以64.02%次之。在少样本场景（10%输入数据）下，DOFA与TerraMind等GFM以59.53%、56.62%的mIoU超越UNet的56.60%。全参数微调时，GFM在不同数据集和模型间表现存在波动，但结合学习率调整可显著提升性能——在GLID与CaFFe两个典型数据集上平均相对提升达12.77%。尽管预训练数据中冰冻圈表征极少，GFM仍展现出显著的领域适应能力并生成有效结果。基于研究结论，我们建议通过编码器微调与超参数优化实现最优性能，若需快速获取结果而无须大量实验时可采用冻结编码器方案。（GitHub地址：https://github.com/Sk-2103/Cryo-Bench）

English

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce Cryo-Bench, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of 66.38, followed by TerraMind at 64.02 across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of 59.53, 56.62, and 56.60, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of 12.77\%. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(https://github.com/Sk-2103/Cryo-Bench{GitHub}).