從像素到散文:一個大型的密集圖像標題數據集
From Pixels to Prose: A Large Dataset of Dense Image Captions
June 14, 2024
作者: Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein
cs.AI
摘要
訓練大型視覺語言模型需要大量高質量的圖像-文本配對。然而,現有的網絡抓取數據集存在噪音並且缺乏詳細的圖像描述。為彌補這一差距,我們引入了PixelProse,這是一個包含超過1600萬個合成生成標題的全面數據集,利用尖端的視覺語言模型進行詳細和準確的描述。為確保數據完整性,我們嚴格分析我們的數據集,包括問題內容,如兒童性虐待材料(CSAM)、個人可識別信息(PII)和有毒內容。我們還提供有價值的元數據,如水印存在和美學分數,有助於進一步篩選數據集。我們希望PixelProse將成為未來視覺語言研究的寶貴資源。PixelProse可在以下網址獲得:https://huggingface.co/datasets/tomg-group-umd/pixelprose
English
Training large vision-language models requires extensive, high-quality
image-text pairs. Existing web-scraped datasets, however, are noisy and lack
detailed image descriptions. To bridge this gap, we introduce PixelProse, a
comprehensive dataset of over 16M (million) synthetically generated captions,
leveraging cutting-edge vision-language models for detailed and accurate
descriptions. To ensure data integrity, we rigorously analyze our dataset for
problematic content, including child sexual abuse material (CSAM), personally
identifiable information (PII), and toxicity. We also provide valuable metadata
such as watermark presence and aesthetic scores, aiding in further dataset
filtering. We hope PixelProse will be a valuable resource for future
vision-language research. PixelProse is available at
https://huggingface.co/datasets/tomg-group-umd/pixelproseSummary
AI-Generated Summary