ChatPaper.aiChatPaper

丹青:一个前沿的大规模中文视觉语言预训练数据集

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

January 15, 2026
作者: Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng, Kaicheng Yang
cs.AI

摘要

視覺語言預訓練模型透過對比式預訓練從大規模圖像-文字對中學習,在多種下游任務中展現出強勁性能。隨著大規模英文圖像-文字資料集(如COYO-700M和LAION-400M)的發布,CLIP和SigLIP等模型已在跨模態檢索和圖像描述等任務中獲得廣泛應用。然而,由於高質量中文圖像-文字資料的匱乏,中文視覺語言預訓練的發展明顯滯後。為彌合這一差距,我們開發了一套完整的流程來構建高質量中文跨模態資料集。據此,我們提出包含1億個從Common Crawl收集的圖像-文字對的「丹青」資料集。與現有資料集不同,丹青透過更嚴格的篩選流程進行構建,具有更優的資料品質。此外,丹青主要基於2024-2025年的網路資料構建,能使模型更好地捕捉演進中的語義趨勢,從而具備更強的實用價值。我們通過對SigLIP2模型進行持續預訓練,將丹青與現有資料集進行比較。實驗結果表明,丹青在中文零樣本分類、跨模態檢索及基於LMM的評估等一系列下游任務中均持續取得更優性能。為推動中文視覺語言預訓練的進一步研究,我們將在創用CC BY 4.0許可協議下開源丹青資料集。
English
Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
PDF292January 17, 2026