시각적 상식 모델을 위한 지역화된 기호 지식 증류

초록

명령어 기반 시각-언어(VL) 모델은 제로샷 방식으로 다양한 멀티모달 작업을 지원하는 유연한 인터페이스를 제공합니다. 그러나 전체 이미지를 대상으로 작동하는 인터페이스는 사용자가 이미지 내 특정 영역을 "가리키고" 접근할 수 있는 기능을 직접적으로 제공하지 않습니다. 이러한 기능은 참조 기반 VL 벤치마크를 지원하는 데 중요할 뿐만 아니라, 정밀한 이미지 내 추론이 필요한 실용적인 애플리케이션에서도 필수적입니다. 우리는 사용자가 (여러) 영역을 입력으로 지정할 수 있는 지역화된 시각 상식 모델(Localized Visual Commonsense models)을 구축했습니다. 이 모델은 대규모 언어 모델(LLM)로부터 지역화된 상식 지식을 샘플링하여 학습합니다: 구체적으로, 우리는 VL 모델 세트에 의해 자동 생성된 전역적 리터럴 이미지 설명과 지역적 리터럴 영역 설명을 기반으로 상식 지식을 수집하도록 LLM을 프롬프트합니다. 고품질 예제를 선택하는 별도의 비평 모델(critic model)을 통해, 지역화된 상식 코퍼스에 대한 학습이 기존 VL 모델을 참조-입력 인터페이스를 지원하도록 성공적으로 증류할 수 있음을 발견했습니다. 제로샷 설정에서의 실험 결과와 인간 평가는 우리의 증류 방법이 생성된 참조 표현을 LLM에 전달하는 베이스라인에 비해 더 정밀한 추론을 수행하는 VL 모델을 만드는 것을 입증합니다.

English

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

시각적 상식 모델을 위한 지역화된 기호 지식 증류

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

초록

Support