ChatPaper.aiChatPaper

ATTN-FIQA:基于可解释性注意力机制的面部图像质量评估与视觉变换器研究

ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers

April 21, 2026
作者: Guray Ozgur, Tahar Chettaoui, Eduarda Caldeira, Jan Niklas Kolf, Marco Huber, Andrea Atzori, Naser Damer, Fadi Boutros
cs.AI

摘要

人脸图像质量评估(FIQA)旨在评估人脸样本的识别效用,对构建可靠的人脸识别系统至关重要。现有方法需依赖计算密集型流程,如多次前向传播、反向传播或额外训练,且近期研究才开始关注视觉Transformer的应用。最新研究表明,这类架构本质上是显著性学习器,其注意力模式天然编码了空间重要性。本文提出ATTN-FIQA这一无需训练的新方法,探究基于预训练视觉Transformer的人脸识别模型中,softmax函数前的注意力分数能否作为质量指标。我们假设注意力强度本质编码质量信息:具有判别性面部特征的高质量图像能实现强查询-键对齐,产生聚焦的高强度注意力模式;而质量退化的图像则生成分散的低强度模式。ATTN-FIQA从最终Transformer模块提取预softmax注意力矩阵,聚合所有图像块的多头注意力信息,并通过简单平均计算图像级质量分数。该方法仅需对预训练模型进行单次前向传播,无需架构修改、反向传播或额外训练。通过对八个基准数据集和四个人脸识别模型的综合评估,本研究表明基于注意力的质量分数能有效反映人脸图像质量,并提供空间可解释性——清晰揭示哪些面部区域对质量判定贡献最大。
English
Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.
PDF21April 29, 2026