开放词汇分割的可转移和原则性效率

摘要

最近预训练的基础视觉-语言模型取得了成功，使得开放词汇分割（OVS）成为可能。尽管表现令人期待，但这种方法引入了两个挑战：1）骨干模型尺寸庞大；2）微调过程中的昂贵成本，导致了沉重的计算开销。这些挑战阻碍了这种 OVS 策略在现实场景中的广泛适用和可负担性。虽然传统方法如模型压缩和高效微调可以解决这些挑战，但它们通常依赖于启发式方法。这意味着它们的解决方案不能轻松转移，并需要在不同模型上重新训练，这是有成本的。在高效的 OVS 环境中，我们的目标是利用较小的模型，降低训练成本的同时，实现与基于大型视觉-语言基础模型的先前 OVS 工作相媲美甚至更好的性能。核心策略是使我们的效率合理化，从而能够在不需要进一步定制的情况下，将其无缝地转移到其他 OVS 框架中。对多样的 OVS 基准进行全面实验，展示了我们在分割准确性和计算成本之间取得的优越权衡，超过了先前的工作。我们的代码可在 https://github.com/Xujxyang/OpenTrans 上找到。

English

Recent success of pre-trained foundation vision-language models makes Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance, this approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. These challenges hinder this OVS strategy from being widely applicable and affordable in real-world scenarios. Although traditional methods such as model compression and efficient fine-tuning can address these challenges, they often rely on heuristics. This means that their solutions cannot be easily transferred and necessitate re-training on different models, which comes at a cost. In the context of efficient OVS, we target achieving performance that is comparable to or even better than prior OVS works based on large vision-language foundation models, by utilizing smaller models that incur lower training costs. The core strategy is to make our efficiency principled and thus seamlessly transferable from one OVS framework to others without further customization. Comprehensive experiments on diverse OVS benchmarks demonstrate our superior trade-off between segmentation accuracy and computation costs over previous works. Our code is available on https://github.com/Xujxyang/OpenTrans

开放词汇分割的可转移和原则性效率

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

摘要

Support