CapsFusion: 대규모 이미지-텍스트 데이터에 대한 재고

초록

대규모 멀티모달 모델은 제로샷 방식으로 다양한 멀티모달 작업을 수행하는 놀라운 일반화 능력을 보여줍니다. 대규모 웹 기반 이미지-텍스트 쌍은 이러한 성공에 근본적으로 기여하지만, 과도한 노이즈 문제를 안고 있습니다. 최근 연구에서는 캡셔닝 모델이 생성한 대체 캡션을 사용하여 주목할 만한 벤치마크 성능을 달성했습니다. 그러나 우리의 실험은 합성 캡션으로 훈련된 모델에서 상당한 확장성 결함과 세계 지식 손실 문제를 드러냈으며, 이러한 문제는 초기 벤치마크 성공에 의해 크게 가려져 있었습니다. 더 깊이 조사한 결과, 기존 합성 캡션의 지나치게 단순화된 언어 구조와 지식 세부사항의 부족이 근본 원인으로 확인되었습니다. 더 높은 품질과 확장 가능한 멀티모달 사전 훈련 데이터를 제공하기 위해, 우리는 대규모 언어 모델을 활용하여 웹 기반 이미지-텍스트 쌍과 합성 캡션의 정보를 통합하고 정제하는 고급 프레임워크인 CapsFusion을 제안합니다. 광범위한 실험 결과, CapsFusion 캡션은 모델 성능(예: COCO와 NoCaps에서 CIDEr 점수 각각 18.8 및 18.3 향상), 샘플 효율성(기준선 대비 11-16배 적은 계산량 요구), 세계 지식 깊이, 그리고 확장성 측면에서 기존 캡션 대비 전반적인 우수성을 보여줍니다. 이러한 효과성, 효율성 및 확장성의 장점은 CapsFusion을 LMM 훈련의 미래 확장을 위한 유망한 후보로 자리매김합니다.

English

Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.

CapsFusion: 대규모 이미지-텍스트 데이터에 대한 재고

CapsFusion: Rethinking Image-Text Data at Scale

초록

Support