대형 모델이 소형 모델을 가르칠 때: 소형 VLM을 활용한 효율적인 시각 질의응답을 위한 레이블 없는 모델 패리티 정렬

초록

대규모 시각-언어 모델(L-VLMs)은 시각 질의응답(VQA)을 포함한 다양한 시각 및 언어 작업에서 뛰어난 성능을 보여주고 있습니다. 그러나 이들의 높은 계산 비용은 자원이 제한된 환경과 추론이 많이 필요한 응용 프로그램에서는 실용적이지 못하게 만듭니다. 반면, 소규모 시각-언어 모델(S-VLMs)은 효율성을 제공하지만 대규모 모델에 비해 상당한 성능 격차를 보입니다. 본 연구에서는 레이블이 없는 이미지와 L-VLMs의 효과적인 지식 전이를 활용하여 S-VLMs을 체계적으로 개선하기 위한 새로운 프레임워크인 모델 패리티 정렬기(MPA)를 소개합니다. 기존의 레이블된 학습 데이터에 의존하는 지식 증류 방법과는 달리, MPA는 S-VLMs와 L-VLMs 간의 지식 격차를 정확히 식별하고 이러한 격차만을 대상으로 학습을 최적화하는 전략적 패리티 기반 접근 방식을 사용합니다. 우리는 텍스트 인식, 차트 해석, 상식 및 사실 이해와 같은 특수한 추론 능력을 요구하는 TextVQA, ST-VQ, ChartQA, OKVQA 등 네 가지 다양한 VQA 벤치마크에서 광범위한 실험을 수행했습니다. 실험 결과, MPA는 모든 벤치마크에서 S-VLMs의 성능을 지속적으로 향상시키며, 계산 효율성을 유지하면서 성능 격차를 줄이는 것을 보여줍니다. 우리는 코드를 공개적으로 제공합니다.

English

Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks, including visual question answering (VQA). However, their high computational cost makes them impractical for resource-constrained settings and inference-heavy applications. In contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts. In this work, we introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs by leveraging unlabeled images and effective knowledge transfer from L-VLMs. Instead of traditional knowledge distillation methods that rely on labeled training data, MPA employs a strategic parity-based approach that precisely identifies the knowledge disparities between S-VLMs and L-VLMs, and optimizes training by targeting only these disparities. We conduct extensive experiments on four diverse VQA benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which requires specialized reasoning capabilities such as text recognition, chart interpretation, and commonsense and factual understanding. Our results demonstrate that MPA consistently enhances the performance of S-VLMs on all benchmarks, reducing the performance gap while maintaining computational efficiency. We make our code publicly available.

대형 모델이 소형 모델을 가르칠 때: 소형 VLM을 활용한 효율적인 시각 질의응답을 위한 레이블 없는 모델 패리티 정렬

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs

초록

Support