ViTNT-FIQA:基于视觉变换器的免训练人脸图像质量评估
ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers
January 9, 2026
作者: Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Jan Niklas Kolf, Marco Huber, Naser Damer, Fadi Boutros
cs.AI
摘要
人脸图像质量评估(FIQA)对于构建可靠的人脸识别系统至关重要。现有方法主要仅利用最终层表征,而免训练方法需要多次前向传播或反向传播。我们提出ViTNT-FIQA,这是一种免训练方法,通过度量视觉Transformer(ViT)中间层块间图像块嵌入演化的稳定性来实现质量评估。我们证明高质量人脸图像在跨层块间表现出稳定的特征优化轨迹,而质量退化图像则呈现无规律的变换。该方法计算连续Transformer层块间L2归一化图像块嵌入的欧氏距离,并将其聚合为图像级质量分数。我们在具有可控退化等级的质量标注合成数据集上实证验证了这种相关性。与现有免训练方法不同,ViTNT-FIQA仅需单次前向传播,无需反向传播或架构修改。通过对八个基准数据集(LFW、AgeDB-30、CFP-FP、CALFW、Adience、CPLFW、XQLFW、IJB-C)的广泛评估表明,ViTNT-FIQA在保持计算效率、可直接应用于任何预训练ViT人脸识别模型的同时,达到了与最先进方法相竞争的性能水平。
English
Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.