小型视觉语言模型的高效测试时扩展

摘要

小型视觉语言模型（VLMs）为大规模模型提供了一种计算效率高的替代方案，但其代价是较弱的泛化能力和下游任务表现。这些不足可以通过测试时扩展技术来解决，但现有方法通常计算需求高，与小型模型资源高效的设计目标相矛盾。为应对这些局限，我们提出了两种新颖且高效的测试时扩展策略，它们利用模型内部特征而非外部监督：(i) 测试时增强（TTAug），通过生成多个增强输入并在无需参数更新的情况下在标记级别聚合输出；(ii) 测试时适应（TTAdapt），在推理过程中利用基于共识的伪标签从TTAug中调整模型参数。通过在九个基准上的广泛实验，我们展示了在保持适合资源受限环境计算效率的同时，实现了性能的持续提升。我们的方法在不同规模模型内部及跨不同VLMs之间无需额外调优的通用性也得到了验证。

English

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

小型视觉语言模型的高效测试时扩展

Efficient Test-Time Scaling for Small Vision-Language Models

摘要

Support