數據縮放是否引領視覺組合泛化？

摘要

組合理解對於人類智能至關重要，然而當代視覺模型是否具備此能力仍不明確。主流的機器學習範式建立在一個前提之上，即擴大數據規模和模型尺寸將提升分佈外性能，包括組合泛化能力。我們通過控制實驗系統性地改變數據規模、概念多樣性及組合覆蓋率來檢驗這一前提。我們發現，組合泛化能力由數據多樣性驅動，而非單純的數據規模。增加組合覆蓋率迫使模型發現一種線性分解的表徵結構，其中概念被分解為可加性組件。我們證明這種結構是效率的關鍵，能夠從少量觀察到的組合中實現完美泛化。評估預訓練模型（DINO、CLIP）時，我們發現其表現雖高於隨機但仍不完美，表明這種結構僅部分存在。我們的工作激勵了在構建多樣化數據集以促進組合泛化方面給予更多重視，並考慮到支持高效組合學習的表徵結構的重要性。代碼可在https://github.com/oshapio/visual-compositional-generalization獲取。

English

Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning. Code available at https://github.com/oshapio/visual-compositional-generalization.

數據縮放是否引領視覺組合泛化？

Does Data Scaling Lead to Visual Compositional Generalization?

摘要

Support