EX-FIQA：利用视觉变换器中层早期退出表征进行人脸图像质量评估

摘要

人脸图像质量评估对于可靠的人脸识别系统至关重要，然而现有基于视觉Transformer的方法仅依赖最终层表征，忽略了网络中间深度捕获的质量相关信息。本文首次通过早期退出机制和分数融合策略，系统研究了ViT中间表征对人脸质量评估的贡献。我们系统分析了ViT-FIQA架构的全部十二个Transformer模块，证明不同深度能捕获差异化且互补的质量相关信息，这通过各网络层间不同的注意力模式与性能特征得到验证。我们提出一种分数融合框架，无需修改架构或额外训练即可整合多个Transformer模块的质量预测结果。早期退出分析揭示了最优的性能-效率权衡方案，在保持竞争力的性能同时实现显著的计算节省。通过使用四种人脸识别模型在八个基准数据集上的广泛评估，我们证明该融合策略优于单退出方案。所提出的质量融合方法采用深度加权平均策略，逐步赋予更深层Transformer模块更高权重，通过有效利用ViT中特征学习的层次化特性实现最佳质量评估性能。本研究挑战了"仅深层特征对人脸分析重要"的传统认知，揭示中间表征蕴含宝贵的质量评估信息。该框架为现实生物识别系统带来实用价值，可在资源受限条件下实现自适应计算，同时保持有竞争力的质量评估能力。

English

Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional training. Our early exit analysis reveals optimal performance-efficiency trade-offs, enabling significant computational savings while maintaining competitive performance. Through extensive evaluation across eight benchmark datasets using four FR models, we demonstrate that our fusion strategy improves upon single-exit approaches. Our proposed quality fusion approach employs depth-weighted averaging that assigns progressively higher importance to deeper transformer blocks, achieving the best quality assessment performance by effectively leveraging the hierarchical nature of feature learning in ViTs. Our work challenges the conventional wisdom that only deep features matter for face analysis, revealing that intermediate representations contain valuable information for quality assessment. The proposed framework offers practical benefits for real-world biometric systems by enabling adaptive computation based on resource constraints while maintaining competitive quality assessment capabilities.