从像素到文本:一个大规模的密集图像描述数据集
From Pixels to Prose: A Large Dataset of Dense Image Captions
June 14, 2024
作者: Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein
cs.AI
摘要
训练大型视觉语言模型需要大量高质量的图像文本对。然而,现有的网络抓取数据集存在噪音且缺乏详细的图像描述。为了弥补这一差距,我们引入了PixelProse,这是一个包含超过1600万个合成生成的标题的综合数据集,利用先进的视觉语言模型进行详细和准确的描述。为了确保数据完整性,我们对数据集进行了严格的分析,包括检测问题内容,如儿童性虐待材料(CSAM)、个人身份信息(PII)和有害内容。我们还提供有价值的元数据,如水印存在和美学评分,有助于进一步筛选数据集。我们希望PixelProse能成为未来视觉语言研究的宝贵资源。PixelProse可在以下链接获取:https://huggingface.co/datasets/tomg-group-umd/pixelprose
English
Training large vision-language models requires extensive, high-quality
image-text pairs. Existing web-scraped datasets, however, are noisy and lack
detailed image descriptions. To bridge this gap, we introduce PixelProse, a
comprehensive dataset of over 16M (million) synthetically generated captions,
leveraging cutting-edge vision-language models for detailed and accurate
descriptions. To ensure data integrity, we rigorously analyze our dataset for
problematic content, including child sexual abuse material (CSAM), personally
identifiable information (PII), and toxicity. We also provide valuable metadata
such as watermark presence and aesthetic scores, aiding in further dataset
filtering. We hope PixelProse will be a valuable resource for future
vision-language research. PixelProse is available at
https://huggingface.co/datasets/tomg-group-umd/pixelproseSummary
AI-Generated Summary