より大規模な視覚モデルが必要ないのはいつか？

要旨

視覚モデルのサイズをスケールアップすることは、より強力な視覚表現を得るための事実上の標準となってきました。本研究では、より大きな視覚モデルが必要とされるポイントを超えた点について議論します。まず、事前学習済みで凍結された小さな視覚モデル（例：ViT-BやViT-L）を複数の画像スケールで実行するScaling on Scales（S^2）の力を示し、分類、セグメンテーション、深度推定、マルチモーダルLLM（MLLM）ベンチマーク、およびロボット操作において、より大きなモデル（例：ViT-HやViT-G）を凌駕できることを実証します。特に、S^2はV*ベンチマークにおけるMLLMの詳細理解において、GPT-4Vなどのモデルを上回る最先端の性能を達成します。我々は、S^2がモデルサイズのスケーリングに比べて好ましいアプローチとなる条件を検証します。より大きなモデルは難しい例に対する汎化性能が優れているという利点がありますが、より大きな視覚モデルの特徴は、マルチスケールの小さなモデルによって十分に近似できることを示します。これは、現在の大規模事前学習モデルによって学習された表現のほとんど、あるいはすべてが、マルチスケールの小さなモデルからも得られることを示唆しています。我々の結果は、マルチスケールの小さなモデルがより大きなモデルと同等の学習能力を持ち、S^2を用いて小さなモデルを事前学習することで、より大きなモデルの利点に匹敵し、あるいはそれを上回ることができることを示しています。我々は、任意の視覚モデルにS^2を1行のコードで適用できるPythonパッケージを公開しました： https://github.com/bfshi/scaling_on_scales。

English

Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S^2), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S^2 achieves state-of-the-art performance in detailed understanding of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S^2 is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S^2 can match or even exceed the advantage of larger models. We release a Python package that can apply S^2 on any vision model with one line of code: https://github.com/bfshi/scaling_on_scales.

より大規模な視覚モデルが必要ないのはいつか？

When Do We Not Need Larger Vision Models?

要旨

Support