피셔 정보를 통한 모델 강건성 측정: 스펙트럼 경계, 이론적 보장, 및 실용적 알고리즘

초록

심층 신경망의 강건성은 안전이 중요한 배포 환경에서 필수적이지만, 기존 평가 방법은 대개 공격에 의존적이며 해석 가능성이 부족하다. 본 논문에서는 피셔 정보 행렬(FIM)의 스펙트럼 노름에 기반한 원칙적이고 공격 독립적인 강건성 지표를 제안한다. 이 지표는 입력 섭동에 대한 모델 출력 분포의 최악 민감도를 정량화한다. 이론적으로, 우리는 FIM이 입력 야코비안의 분산과 같음을 규명하고, VGG, ResNet, DenseNet 및 Transformer를 포함한 일반적인 아키텍처에 대한 폐쇄형 스펙트럼 상한을 도출하여 최초의 이론적 강건성 순위를 제공한다. 확장 가능한 평가를 위해, 화이트박스 및 블랙박스 설정을 모두 지원하는 거듭제곱 반복법과 허친슨 기반 추정을 포함한 효율적인 알고리즘을 개발한다. CIFAR, ImageNet 및 의료 영상을 포함한 여러 데이터셋과 다양한 아키텍처에 걸친 광범위한 실험은 제안한 지표와 적대적 취약성 간의 강한 상관관계를 보여준다. 본 프레임워크는 공격 기반 평가를 보완하는 해석 가능한 진단 도구로 기능하며, 아키텍처 민감성에 대한 통찰을 제공하고 더 강건한 모델 설계를 안내한다. 코드는 https://github.com/franz-chang/SRP/ 에서 확인할 수 있다.

English

The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst-case sensitivity of the model's output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed-form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson-based estimation, that support both white-box and black-box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack-based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: https://github.com/franz-chang/SRP/.