ATTN-FIQA: Interpreteerbare aandachtgebaseerde beoordeling van gezichtsbeeldkwaliteit met Vision Transformers

Samenvatting

Face Image Quality Assessment (FIQA) heeft als doel de herkenningsbruikbaarheid van gezichtsmonsters te beoordelen en is essentieel voor betrouwbare gezichtsherkenningssystemen (FR). Bestaande benaderingen vereisen rekenintensieve procedures, zoals meerdere forward passes, backpropagatie of aanvullende training, en pas recentelijk onderzoek richt zich op het gebruik van Vision Transformers. Recente studies benadrukken dat deze architecturen inherent functioneren als salientie-leerders, waarbij aandachtspatronen van nature ruimtelijk belang coderen. Dit werk stelt ATTN-FIQA voor, een nieuwe trainingsvrije aanpak die onderzoekt of pre-softmax-aandachtsscores van vooraf getrainde, op Vision Transformers gebaseerde gezichtsherkenningsmodellen als kwaliteitsindicatoren kunnen dienen. Wij veronderstellen dat aandachtsmagnitudes intrinsiek kwaliteit coderen: hoogwaardige afbeeldingen met onderscheidende gezichtskenmerken maken sterke query-key-aligneringen mogelijk, wat gerichte, hoogmagnitude-aandachtspatronen oplevert, terwijl gedegradeerde afbeeldingen diffuse, laagmagnitude-patronen genereren. ATTN-FIQA extraheert pre-softmax-aandachtsmatrices uit het laatste transformerblok, aggregeert multi-head-aandachtsinformatie over alle patches en berekent beeldniveau-kwaliteitsscores door eenvoudige middeling. Dit vereist slechts één forward pass door vooraf getrainde modellen, zonder architectuurwijzigingen, backpropagatie of aanvullende training. Door middel van uitgebreide evaluatie over acht benchmarkdatasets en vier FR-modellen toont dit werk aan dat op aandacht gebaseerde kwaliteitsscores effectief correleren met gezichtsbeeldkwaliteit en ruimtelijke interpreteerbaarheid bieden, waarbij wordt onthuld welke gezichtsregio's het meest bijdragen aan de kwaliteitsbepaling.

English

Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.

ATTN-FIQA: Interpreteerbare aandachtgebaseerde beoordeling van gezichtsbeeldkwaliteit met Vision Transformers

ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers

Samenvatting

Support