Overdraagbare en Principes-Gestuurde Efficiëntie voor Open-Vocabularium Segmentatie

Samenvatting

Het recente succes van vooraf getrainde foundation vision-language modellen maakt Open-Vocabulary Segmentatie (OVS) mogelijk. Ondanks de veelbelovende prestaties, introduceert deze aanpak zware computationele overhead door twee uitdagingen: 1) de grote modelgroottes van de backbone; 2) de hoge kosten tijdens het fine-tunen. Deze uitdagingen belemmeren de brede toepasbaarheid en betaalbaarheid van deze OVS-strategie in real-world scenario's. Hoewel traditionele methoden zoals modelcompressie en efficiënt fine-tunen deze uitdagingen kunnen aanpakken, zijn ze vaak gebaseerd op heuristieken. Dit betekent dat hun oplossingen niet eenvoudig kunnen worden overgedragen en hertraining op verschillende modellen vereisen, wat gepaard gaat met kosten. In de context van efficiënte OVS streven we ernaar om prestaties te bereiken die vergelijkbaar zijn met of zelfs beter dan eerdere OVS-werken gebaseerd op grote vision-language foundation modellen, door gebruik te maken van kleinere modellen die lagere trainingskosten met zich meebrengen. De kernstrategie is om onze efficiëntie principieel te maken en daardoor naadloos overdraagbaar van het ene OVS-framework naar andere zonder verdere aanpassing. Uitgebreide experimenten op diverse OVS-benchmarks demonstreren onze superieure balans tussen segmentatienauwkeurigheid en rekenkosten in vergelijking met eerdere werken. Onze code is beschikbaar op https://github.com/Xujxyang/OpenTrans.

English

Recent success of pre-trained foundation vision-language models makes Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance, this approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. These challenges hinder this OVS strategy from being widely applicable and affordable in real-world scenarios. Although traditional methods such as model compression and efficient fine-tuning can address these challenges, they often rely on heuristics. This means that their solutions cannot be easily transferred and necessitate re-training on different models, which comes at a cost. In the context of efficient OVS, we target achieving performance that is comparable to or even better than prior OVS works based on large vision-language foundation models, by utilizing smaller models that incur lower training costs. The core strategy is to make our efficiency principled and thus seamlessly transferable from one OVS framework to others without further customization. Comprehensive experiments on diverse OVS benchmarks demonstrate our superior trade-off between segmentation accuracy and computation costs over previous works. Our code is available on https://github.com/Xujxyang/OpenTrans

Overdraagbare en Principes-Gestuurde Efficiëntie voor Open-Vocabularium Segmentatie

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Samenvatting

Support