한 장의 그림은 77개의 텍스트 토큰보다 더 많은 가치를 지닌다: CLIP 스타일 모델의 밀집 캡션 평가

초록

대규모 시각-언어 데이터셋의 큐레이션 방법은 데이터셋의 크기와 품질 사이에서 균형을 맞춥니다. 그러나 현재 사용 가능한 가장 고품질의 큐레이션된 캡션조차도 이미지의 풍부한 시각적 세부 사항을 담기에는 너무 짧습니다. 우리는 밀집하고 높은 정렬도를 가진 이미지-텍스트 쌍의 가치를 보여주기 위해, 8012개의 자연 이미지로 구성된 Densely Captioned Images (DCI) 데이터셋을 수집했습니다. 이 데이터셋은 각각 평균 1000단어 이상의 마스크 정렬 설명이 포함된 인간 주석 데이터입니다. 이미지의 특정 부분과 정확하고 신뢰할 수 있는 캡션이 연결되어 있기 때문에, 우리는 각 캡션을 해당 서브크롭과 매칭하는 새로운 작업을 통해 시각-언어 모델(VLM)의 이미지 내용 이해를 평가할 수 있습니다. 현재 모델들은 종종 77개의 텍스트 토큰으로 제한되기 때문에, 각 캡션 길이가 제한된 요약 버전(sDCI)도 소개합니다. 우리는 표준 벤치마크에서 진전을 이루는 현대 기술들이 우리의 sDCI 기반 벤치마크에서도 상당한 개선으로 이어지지 않음을 보여줍니다. 마지막으로, 우리는 sDCI를 사용하여 CLIP을 미세 조정하고, 작은 훈련 세트에도 불구하고 베이스라인 대비 상당한 개선을 보여줍니다. 인간 주석이 포함된 첫 번째 밀집 이미지 캡셔닝 데이터셋을 공개함으로써, 우리는 차세대 VLM을 위한 새로운 벤치마크나 미세 조정 방법의 개발을 가능하게 하기를 바랍니다.

English

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

한 장의 그림은 77개의 텍스트 토큰보다 더 많은 가치를 지닌다: CLIP 스타일 모델의 밀집 캡션 평가

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

초록

Support