MedGen：通过细粒度标注医疗视频实现规模化医疗视频生成

摘要

近期，视频生成技术在开放领域取得了显著进展，然而医学视频生成仍处于探索不足的状态。医学视频在临床培训、教育和模拟等应用中至关重要，不仅要求高视觉保真度，还需严格的医学准确性。然而，现有模型在处理医学提示时，常生成不真实或错误的内容，这主要归因于缺乏针对医学领域的大规模、高质量数据集。为填补这一空白，我们推出了MedVideoCap-55K，这是首个大规模、多样化且富含字幕的医学视频生成数据集。该数据集包含超过55,000条精选片段，覆盖真实世界的医疗场景，为训练通用医学视频生成模型奠定了坚实基础。基于此数据集，我们开发了MedGen，其在开源模型中表现领先，并在多个基准测试中与商业系统在视觉质量和医学准确性上不相上下。我们期望我们的数据集和模型能成为宝贵资源，推动医学视频生成领域的进一步研究。我们的代码和数据可在https://github.com/FreedomIntelligence/MedGen获取。

English

Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen