HISTAI:一個開源的大規模全切片影像數據集,專為計算病理學而設計
HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
May 17, 2025
作者: Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
cs.AI
摘要
數位病理學(DP)領域的最新進展,特別是透過人工智慧和基礎模型的應用,凸顯了大規模、多樣化且富含註解資料集的重要性。儘管這些資料集扮演著關鍵角色,但公開可用的全切片影像(WSI)資料集往往在規模、組織多樣性以及全面的臨床元數據方面存在不足,這限制了AI模型的穩健性和泛化能力。為此,我們推出了HISTAI資料集,這是一個大型、多模態、開放存取的WSI集合,包含來自多種組織類型的超過60,000張切片。HISTAI資料集中的每個病例都配備了詳盡的臨床元數據,包括診斷、人口統計資訊、詳細的病理註解以及標準化的診斷編碼。該資料集旨在填補現有資源中的空白,促進創新、可重複性以及臨床相關的計算病理學解決方案的開發。該資料集可透過https://github.com/HistAI/HISTAI訪問。
English
Recent advancements in Digital Pathology (DP), particularly through
artificial intelligence and Foundation Models, have underscored the importance
of large-scale, diverse, and richly annotated datasets. Despite their critical
role, publicly available Whole Slide Image (WSI) datasets often lack sufficient
scale, tissue diversity, and comprehensive clinical metadata, limiting the
robustness and generalizability of AI models. In response, we introduce the
HISTAI dataset, a large, multimodal, open-access WSI collection comprising over
60,000 slides from various tissue types. Each case in the HISTAI dataset is
accompanied by extensive clinical metadata, including diagnosis, demographic
information, detailed pathological annotations, and standardized diagnostic
coding. The dataset aims to fill gaps identified in existing resources,
promoting innovation, reproducibility, and the development of clinically
relevant computational pathology solutions. The dataset can be accessed at
https://github.com/HistAI/HISTAI.Summary
AI-Generated Summary