数据缩放是否促进视觉组合泛化?
Does Data Scaling Lead to Visual Compositional Generalization?
July 9, 2025
作者: Arnas Uselis, Andrea Dittadi, Seong Joon Oh
cs.AI
摘要
组合理解对于人类智能至关重要,然而目前尚不清楚现代视觉模型是否具备这一能力。主流的机器学习范式建立在这样一个前提之上:扩大数据规模和模型尺寸将提升分布外性能,包括组合泛化能力。我们通过控制实验系统地改变数据规模、概念多样性和组合覆盖范围,对这一前提进行了验证。研究发现,组合泛化能力由数据多样性驱动,而非单纯的数据规模。增加组合覆盖范围迫使模型发现一种线性分解的表征结构,其中概念被分解为可加性组件。我们证明这种结构是效率的关键,使得模型能够从少量观察到的组合中实现完美泛化。在评估预训练模型(DINO、CLIP)时,我们发现其表现虽高于随机水平但仍不完美,表明这种结构仅部分存在。我们的工作强调了构建多样化数据集以促进组合泛化的重要性,并指出了支持高效组合学习的表征结构的关键作用。代码可在https://github.com/oshapio/visual-compositional-generalization获取。
English
Compositional understanding is crucial for human intelligence, yet it remains
unclear whether contemporary vision models exhibit it. The dominant machine
learning paradigm is built on the premise that scaling data and model sizes
will improve out-of-distribution performance, including compositional
generalization. We test this premise through controlled experiments that
systematically vary data scale, concept diversity, and combination coverage. We
find that compositional generalization is driven by data diversity, not mere
data scale. Increased combinatorial coverage forces models to discover a
linearly factored representational structure, where concepts decompose into
additive components. We prove this structure is key to efficiency, enabling
perfect generalization from few observed combinations. Evaluating pretrained
models (DINO, CLIP), we find above-random yet imperfect performance, suggesting
partial presence of this structure. Our work motivates stronger emphasis on
constructing diverse datasets for compositional generalization, and considering
the importance of representational structure that enables efficient
compositional learning. Code available at
https://github.com/oshapio/visual-compositional-generalization.