数据缩放是否促进视觉组合泛化？

摘要

组合理解对于人类智能至关重要，然而目前尚不清楚现代视觉模型是否具备这一能力。主流的机器学习范式建立在这样一个前提之上：扩大数据规模和模型尺寸将提升分布外性能，包括组合泛化能力。我们通过控制实验系统地改变数据规模、概念多样性和组合覆盖范围，对这一前提进行了验证。研究发现，组合泛化能力由数据多样性驱动，而非单纯的数据规模。增加组合覆盖范围迫使模型发现一种线性分解的表征结构，其中概念被分解为可加性组件。我们证明这种结构是效率的关键，使得模型能够从少量观察到的组合中实现完美泛化。在评估预训练模型（DINO、CLIP）时，我们发现其表现虽高于随机水平但仍不完美，表明这种结构仅部分存在。我们的工作强调了构建多样化数据集以促进组合泛化的重要性，并指出了支持高效组合学习的表征结构的关键作用。代码可在https://github.com/oshapio/visual-compositional-generalization获取。

English

Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning. Code available at https://github.com/oshapio/visual-compositional-generalization.

数据缩放是否促进视觉组合泛化？

Does Data Scaling Lead to Visual Compositional Generalization?

摘要

Support