언제 더 큰 비전 모델이 필요하지 않을까?

초록

비전 모델의 크기를 확장하는 것은 더 강력한 시각적 표현을 얻기 위한 사실상의 표준이 되어 왔습니다. 본 연구에서는 더 큰 비전 모델이 필요하지 않은 시점에 대해 논의합니다. 먼저, 우리는 Scaling on Scales(S^2)의 힘을 보여줍니다. 이 방법은 사전 학습되고 고정된 더 작은 비전 모델(예: ViT-B 또는 ViT-L)을 여러 이미지 스케일에 걸쳐 실행함으로써 더 큰 모델(예: ViT-H 또는 ViT-G)을 분류, 세그멘테이션, 깊이 추정, 멀티모달 LLM(MLLM) 벤치마크 및 로봇 조작에서 능가할 수 있음을 입증합니다. 특히, S^2는 V* 벤치마크에서 MLLM의 세부 이해에 있어 GPT-4V와 같은 모델을 능가하는 최첨단 성능을 달성합니다. 우리는 S^2가 모델 크기 확장에 비해 선호되는 접근 방식이 되는 조건을 검토합니다. 더 큰 모델은 어려운 예제에서 더 나은 일반화 능력을 갖는 장점이 있지만, 우리는 더 큰 비전 모델의 특징이 다중 스케일의 더 작은 모델에 의해 잘 근사될 수 있음을 보여줍니다. 이는 현재의 대규모 사전 학습 모델이 학습한 표현의 대부분, 혹은 전부가 다중 스케일의 더 작은 모델에서도 얻을 수 있음을 시사합니다. 우리의 결과는 다중 스케일의 더 작은 모델이 더 큰 모델과 비슷한 학습 능력을 가지며, S^2를 사용해 더 작은 모델을 사전 학습하면 더 큰 모델의 장점을 따라잡거나 심지어 능가할 수 있음을 보여줍니다. 우리는 S^2를 단 한 줄의 코드로 어떤 비전 모델에든 적용할 수 있는 Python 패키지를 공개합니다: https://github.com/bfshi/scaling_on_scales.

English

Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S^2), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S^2 achieves state-of-the-art performance in detailed understanding of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S^2 is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S^2 can match or even exceed the advantage of larger models. We release a Python package that can apply S^2 on any vision model with one line of code: https://github.com/bfshi/scaling_on_scales.

언제 더 큰 비전 모델이 필요하지 않을까?

When Do We Not Need Larger Vision Models?

초록

Summary

Support

Support