소형 비전-언어 모델을 위한 효율적인 테스트 타임 스케일링

초록

소형 비전-언어 모델(VLMs)은 더 큰 모델에 비해 계산 효율성을 제공하지만, 일반화 능력과 다운스트림 작업 성능이 약한 단점이 있습니다. 이러한 한계는 테스트 시 스케일링 기법으로 해결할 수 있지만, 기존 방법들은 일반적으로 계산 비용이 많이 들어 소형 모델의 자원 효율적 설계 목표와 상충됩니다. 이러한 문제를 해결하기 위해, 우리는 외부 감독이 아닌 모델 내부 특징을 활용한 두 가지 새로운 효율적인 테스트 시 스케일링 전략을 제안합니다: (i) 테스트 시 증강(TTAug)은 다수의 증강된 입력을 생성하고 매개변수 업데이트 없이 토큰 수준에서 출력을 집계하며, (ii) 테스트 시 적응(TTAdapt)은 TTAug에서 생성된 합의 기반 의사 레이블을 사용하여 추론 중에 모델 매개변수를 적응시킵니다. 9개의 벤치마크에 걸친 광범위한 실험을 통해, 우리는 자원이 제한된 환경에 적합한 계산 효율성을 유지하면서도 일관된 성능 향상을 입증합니다. 우리의 접근 방식의 일반성은 추가 튜닝 없이도 다양한 규모의 모델 내부와 서로 다른 VLMs 간에서 모두 입증되었습니다.

English

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

소형 비전-언어 모델을 위한 효율적인 테스트 타임 스케일링

Efficient Test-Time Scaling for Small Vision-Language Models

초록

Support