概念感知批采样提升语言-图像预训练效果
Concept-Aware Batch Sampling Improves Language-Image Pretraining
November 25, 2025
作者: Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge
cs.AI
摘要
视觉语言模型应使用何种数据进行训练?为回答这个问题,当前多数数据筛选工作聚焦于数据集质量。然而现有方法大多存在两个局限:(一)离线性——即基于预设过滤标准生成静态数据集;(二)概念无关性——即采用基于模型的过滤器,这会引入额外数据偏差。本研究突破此类离线式、概念无关的方法,提出更灵活的任务自适应在线概念化筛选方案。我们的首要贡献是DataConcept:一个包含1.28亿网络爬取图文对的数据集,其中标注了细粒度的概念构成信息。基于DataConcept,我们提出概念感知批量采样框架(CABS),这种简洁高效的动态批构建方法能根据特定目标分布灵活组批。我们开发两种变体:(一)多样性最大化(CABS-DM)——构建覆盖广泛概念的批次;(二)频次最大化(CABS-FM)——构建高目标密度的批次。通过对28个基准的广泛评估,我们证明CABS方法显著提升CLIP/SigLIP模型性能,训练出高效能模型。总体而言,CABS为专有在线数据筛选算法提供了强有力的开源替代方案,使实践者能通过自定义概念分布优化特定下游任务。
English
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.