레이블 없이 시각-언어 모델을 적응시키는 방법: 포괄적 조사

초록

비전-언어 모델(VLMs)은 다양한 작업에서 뛰어난 일반화 능력을 입증해 왔습니다. 그러나 특정 다운스트림 시나리오에 직접 적용할 때는 작업별 적응 없이는 성능이 종종 최적에 미치지 못합니다. 데이터 효율성을 유지하면서 유용성을 높이기 위해, 최근 연구에서는 레이블이 없는 데이터에 의존하지 않는 비지도 적응 방법에 점점 더 초점을 맞추고 있습니다. 이 분야에 대한 관심이 증가하고 있음에도 불구하고, 비지도 VLM 적응에 전념한 통일된 작업 지향적 조사가 여전히 부족합니다. 이러한 격차를 해소하기 위해, 우리는 이 분야에 대한 포괄적이고 구조화된 개요를 제시합니다. 우리는 레이블이 없는 시각 데이터의 가용성과 특성을 기반으로 한 분류 체계를 제안하며, 기존 접근 방식을 네 가지 주요 패러다임으로 분류합니다: 데이터 없는 전이(Data-Free Transfer), 비지도 도메인 전이(Unsupervised Domain Transfer), 에피소딕 테스트 타임 적응(Episodic Test-Time Adaptation), 그리고 온라인 테스트 타임 적응(Online Test-Time Adaptation). 이 프레임워크 내에서, 우리는 각 패러다임과 관련된 핵심 방법론과 적응 전략을 분석하여 이 분야에 대한 체계적인 이해를 확립하고자 합니다. 또한, 다양한 애플리케이션에서의 대표적인 벤치마크를 검토하고, 미래 연구를 위한 열린 과제와 유망한 방향을 강조합니다. 관련 문헌의 적극적으로 유지되는 저장소는 https://github.com/tim-learn/Awesome-LabelFree-VLMs에서 확인할 수 있습니다.

English

Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs.

레이블 없이 시각-언어 모델을 적응시키는 방법: 포괄적 조사

Adapting Vision-Language Models Without Labels: A Comprehensive Survey

초록

Support