U-Bench：通过百变体基准测试全面理解U-Net

摘要

在過去十年中，U-Net 已成為醫學影像分割領域的主導架構，催生了數千種 U 形變體的發展。儘管其應用廣泛，但至今仍缺乏一個全面的基準來系統評估這些模型的性能與實用性，這主要歸因於統計驗證不足以及對跨多樣數據集的效率與泛化能力考慮有限。為彌補這一空白，我們提出了 U-Bench，這是首個大規模、統計嚴謹的基準，評估了 100 種 U-Net 變體在 28 個數據集和 10 種成像模式下的表現。我們的貢獻體現在三個方面：(1) 全面評估：U-Bench 從統計穩健性、零樣本泛化能力和計算效率三個關鍵維度評估模型。我們引入了一種新穎的指標——U-Score，它綜合考量了性能與效率的權衡，為模型進展提供了面向部署的視角。(2) 系統分析與模型選擇指導：我們總結了大規模評估中的關鍵發現，並系統分析了數據集特徵與架構範式對模型性能的影響。基於這些洞察，我們提出了一個模型顧問代理，以指導研究人員為特定數據集和任務選擇最合適的模型。(3) 公開可用性：我們提供了所有代碼、模型、協議和權重，使社區能夠復現我們的結果並將基準擴展至未來的方法。總之，U-Bench 不僅揭示了以往評估中的不足，還為下一個十年基於 U-Net 的分割模型建立了公平、可重現且實踐相關的基準測試基礎。該項目可訪問於：https://fenghetan9.github.io/ubench。代碼可於以下網址獲取：https://github.com/FengheTan9/U-Bench。

English

Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.

U-Bench：通过百变体基准测试全面理解U-Net

U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

摘要

Support