UPLiFT: Efficiënte pixel-dichte feature-upsampling met lokale attentiemechanismen

Samenvatting

Het domein van taakonafhankelijke feature-upsampling is naar voren gekomen als een veelbelovend onderzoeksgebied om efficiënter dichtere features te creëren vanuit voorgetrainde visuele backbones. Deze methoden fungeren als een snellere route om dichte features te verkrijgen tegen een fractie van de kosten, door te leren hoe laagresolutie-features naar hoogresolutieversies kunnen worden gemapt. Terwijl vroege werken in dit domein iteratieve upsampling-benaderingen gebruikten, zijn recentere werken overgestapt op op cross-attention gebaseerde methoden, die het risico lopen in dezelfde schaalbaarheidsproblemen terecht te komen als de backbones die ze upsamplen. In dit werk tonen we aan dat iteratieve upsampling-methoden nog steeds kunnen concurreren met op cross-attention gebaseerde methoden; bovendien kunnen ze state-of-the-art prestaties leveren tegen lagere inferentiekosten. Wij stellen UPLiFT voor, een architectuur voor Universele Pixel-dichte Lightweight Feature Transformaties. We introduceren ook een efficiënte Local Attender-operator om de beperkingen van eerdere iteratieve feature-upsampling-methoden te overwinnen. Deze operator gebruikt een alternatieve formulation voor attentionele pooling die volledig lokaal is gedefinieerd. We tonen aan dat onze Local Attender UPLiFT in staat stelt om stabiele features te behouden gedurende het upsampling-proces, wat state-of-the-art prestaties mogelijk maakt tegen lagere inferentiekosten dan bestaande pixel-dichte feature-upsamplers. Daarnaast passen we UPLiFT toe op generatieve downstream-taken en laten we zien dat het competitieve prestaties bereikt met state-of-the-art Gekoppelde Flow Matching-modellen voor VAE-feature-upsampling. Al met al biedt UPLiFT een veelzijdige en efficiënte aanpak voor het creëren van dichtere features.

English

The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.

UPLiFT: Efficiënte pixel-dichte feature-upsampling met lokale attentiemechanismen

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

Samenvatting

Support