丹青:一个前沿的大规模中文视觉-语言预训练数据集
DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
January 15, 2026
作者: Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng, Kaicheng Yang
cs.AI
摘要
视觉语言预训练(VLP)模型通过对比预训练从大规模图文对中学习,在下游任务中展现出强大性能。随着海量英文图文数据集(如COYO-700M和LAION-400M)的发布,CLIP、SigLIP等模型已在跨模态检索、图像描述等任务中得到广泛应用。然而,由于高质量中文图文数据的稀缺,中文视觉语言预训练的发展明显滞后。为弥补这一差距,我们开发了一套完整的高质量中文跨模态数据集构建流程,并由此提出包含1亿个从Common Crawl收集的图文对数据集——丹青(DanQing)。与现有数据集不同,丹青通过更严格的筛选流程进行构建,数据质量显著提升。此外,该数据集主要基于2024-2025年的网络数据构建,能使模型更好地捕捉语义演变趋势,从而具备更强的实用价值。我们通过持续预训练SigLIP2模型将丹青与现有数据集进行对比实验,结果表明丹青在中文零样本分类、跨模态检索及基于LMM的评估等下游任务中均取得更优性能。为促进中文视觉语言预训练研究的发展,我们将遵循知识共享CC-BY 4.0协议开源丹青数据集。
English
Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.