ATTN-FIQA: 비전 트랜스포머 기반 주의력 메커니즘 해석이 가능한 얼굴 이미지 품질 평가

초록

얼굴 이미지 품질 평가(FIQA)는 얼굴 샘플의 인식 유용성을 평가하는 것을 목표로 하며, 신뢰할 수 있는 얼굴 인식(FR) 시스템에 필수적입니다. 기존 접근법들은 여러 번의 순전파, 역전파 또는 추가 학습과 같이 계산 비용이 많이 드는 절차를 필요로 하며, Vision Transformer의 사용에 초점을 맞춘 연구는 최근에야 등장했습니다. 최근 연구들은 이러한 아키텍처들이 주의 패턴이 공간적 중요도를 자연스럽게 인코딩하는 돌출부 학습기로 본질적으로 기능한다는 점을 강조했습니다. 본 연구는 사전 학습된 Vision Transformer 기반 얼굴 인식 모델에서 추출한 소프트맥스 함수 적용 전 주의 점수(Pre-softmax attention scores)가 품질 지표로 활용될 수 있는지 조사하는 새로운 학습-free 접근법인 ATTN-FIQA를 제안합니다. 우리는 주의 크기가 본질적으로 품질을 인코딩한다고 가정합니다. 즉, 판별력 있는 얼굴 특징을 가진 고품질 이미지는 집중되고 높은 크기의 주의 패턴을 생성하는 강력한 Query-Key 정렬을 가능하게 하는 반면, 저하된 이미지는 분산되고 낮은 크기의 패턴을 생성합니다. ATTN-FIQA는 최종 Transformer 블록에서 소프트맥스 함수 적용 전 주의 행렬을 추출하고, 모든 패치에 걸쳐 다중 헤드 주의 정보를 집계하며, 단순 평균을 통해 이미지 수준의 품질 점수를 계산합니다. 이 방법은 사전 학습된 모델을 통해 단 한 번의 순전파만 필요로 하며, 아키텍처 수정, 역전파 또는 추가 학습이 필요하지 않습니다. 8개의 벤치마크 데이터셋과 4개의 FR 모델에 걸친 포괄적인 평가를 통해, 본 연구는 주의 기반 품질 점수가 얼굴 이미지 품질과 효과적으로 상관 관계를 가지며 어떤 얼굴 영역이 품질 결정에 가장 기여하는지를 보여주는 공간적 해석 가능성을 제공함을 입증합니다.

English

Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.

ATTN-FIQA: 비전 트랜스포머 기반 주의력 메커니즘 해석이 가능한 얼굴 이미지 품질 평가

ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers

초록

Support