시각-언어 모델의 제로샷 일반화를 위한 테스트 타임 스펙트럼 인식 잠재 조향

초록

비전-언어 모델(VLMs)은 제로샷 추론에서 뛰어난 성능을 보이지만, 테스트 시점 도메인 변화가 발생하면 성능이 저하되는 경우가 많습니다. 이에 따라 레이블이 지정되지 않은 단일 이미지에 VLM을 적응시키기 위한 에피소드형 테스트 타임 적응 전략이 최근 강력한 기법으로 부상했습니다. 그러나 기존의 테스트 타임 프롬프트 튜닝과 같은 적응 전략은 일반적으로 대규모 인코더 가중치에 대한 역전파를 수행하거나 핵심 모델 구성 요소를 변경해야 합니다. 본 연구에서는 스펙트럼 인식 테스트 타임 스티어링(STS)이라는 경량화된 적응 프레임워크를 소개합니다. STS는 텍스트 임베딩에서 스펙트럼 부분공간을 추출하여 주요 의미 방향을 정의하고, 증강된 뷰 간 엔트로피를 최소화하기 위해 샘플별 소수의 변화 매개변수를 적응시켜 잠재 표현을 스펙트럼 인식 방식으로 조종하는 방법을 학습합니다. STS는 추론 과정 전체를 잠재 공간에서 수행하며, 고정된 인코더를 통해 역전파를 하거나 인코더를 수정하지 않습니다. 표준 평가 프로토콜을 기반으로 한 포괄적인 실험 결과, STS는 최신 테스트 타임 적응 방법들을 크게 능가하거나 유사한 성능을 보이면서도 소수의 매개변수만 추가하며, 기존 테스트 타임 프롬프트 튜닝 대비 최대 8배 빠른 추론 속도와 12배 적은 메모리 사용량을 달성했습니다. 코드는 https://github.com/kdafnis/STS에서 확인할 수 있습니다.

English

Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

시각-언어 모델의 제로샷 일반화를 위한 테스트 타임 스펙트럼 인식 잠재 조향

Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

초록

Support