MedINST:生物医学指令的元数据集
MedINST: Meta Dataset of Biomedical Instructions
October 17, 2024
作者: Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, Qingyu Chen
cs.AI
摘要
在医学分析领域中整合大型语言模型(LLM)技术已带来重大进展,然而大规模、多样化和有良好注释的数据集的稀缺仍然是一项重大挑战。医学数据和任务以各种格式、大小和其他参数存在,需要广泛的预处理和标准化,以便有效用于训练LLMs。为了解决这些挑战,我们介绍了MedINST,即生物医学指令元数据集,这是一个新颖的多领域、多任务指令元数据集。MedINST包括133个生物医学自然语言处理任务和超过700万个训练样本,使其成为迄今为止最全面的生物医学指令数据集。利用MedINST作为元数据集,我们策划了MedINST32,这是一个具有不同任务难度的挑战性基准,旨在评估LLMs的泛化能力。我们在MedINST上对几个LLMs进行微调,并在MedINST32上进行评估,展示了跨任务泛化能力的增强。
English
The integration of large language model (LLM) techniques in the field of
medical analysis has brought about significant advancements, yet the scarcity
of large, diverse, and well-annotated datasets remains a major challenge.
Medical data and tasks, which vary in format, size, and other parameters,
require extensive preprocessing and standardization for effective use in
training LLMs. To address these challenges, we introduce MedINST, the Meta
Dataset of Biomedical Instructions, a novel multi-domain, multi-task
instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over
7 million training samples, making it the most comprehensive biomedical
instruction dataset to date. Using MedINST as the meta dataset, we curate
MedINST32, a challenging benchmark with different task difficulties aiming to
evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and
evaluate on MedINST32, showcasing enhanced cross-task generalization.Summary
AI-Generated Summary