小型視覺語言模型的高效測試時縮放

摘要

小型視覺語言模型（VLMs）提供了一種計算效率高的替代方案，相較於大型模型，其代價是較弱的泛化能力和下游任務表現。這些不足可以通過測試時擴展技術來解決，但現有方法通常計算需求高，與小型模型的資源高效設計目標相矛盾。為了解決這些限制，我們提出了兩種新穎且高效的測試時擴展策略，這些策略利用模型內部特徵而非外部監督：(i) 測試時增強（TTAug），它生成多個增強輸入並在無需參數更新的情況下在令牌級別聚合輸出，以及(ii) 測試時適應（TTAdapt），它在推理過程中利用來自TTAug的基於共識的偽標籤來調整模型參數。通過在九個基準上的廣泛實驗，我們展示了在保持適合資源受限環境的計算效率的同時，性能的持續提升。我們方法的通用性在不同規模的模型內和跨不同VLMs無需額外調優的情況下得到了證明。

English

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

小型視覺語言模型的高效測試時縮放

Efficient Test-Time Scaling for Small Vision-Language Models

摘要

Support