HISTAI:一个开源的大规模全切片图像数据集,用于计算病理学研究
HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
May 17, 2025
作者: Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
cs.AI
摘要
数字病理学(DP)领域的最新进展,尤其是通过人工智能和基础模型的应用,凸显了大规模、多样化且标注丰富的数据集的重要性。尽管这些数据集至关重要,但公开可用的全切片图像(WSI)数据集往往在规模、组织多样性及全面的临床元数据方面存在不足,限制了AI模型的鲁棒性和泛化能力。为此,我们推出了HISTAI数据集,这是一个大型、多模态、开放访问的WSI集合,包含来自多种组织类型的超过60,000张切片。HISTAI数据集中的每个病例均附有详尽的临床元数据,涵盖诊断、人口统计信息、详细的病理学标注以及标准化的诊断编码。该数据集旨在填补现有资源中的空白,推动创新、可重复性以及临床相关计算病理学解决方案的发展。数据集可通过https://github.com/HistAI/HISTAI访问。
English
Recent advancements in Digital Pathology (DP), particularly through
artificial intelligence and Foundation Models, have underscored the importance
of large-scale, diverse, and richly annotated datasets. Despite their critical
role, publicly available Whole Slide Image (WSI) datasets often lack sufficient
scale, tissue diversity, and comprehensive clinical metadata, limiting the
robustness and generalizability of AI models. In response, we introduce the
HISTAI dataset, a large, multimodal, open-access WSI collection comprising over
60,000 slides from various tissue types. Each case in the HISTAI dataset is
accompanied by extensive clinical metadata, including diagnosis, demographic
information, detailed pathological annotations, and standardized diagnostic
coding. The dataset aims to fill gaps identified in existing resources,
promoting innovation, reproducibility, and the development of clinically
relevant computational pathology solutions. The dataset can be accessed at
https://github.com/HistAI/HISTAI.