거울 속을 들여다보기: 이상한 이미지에 대한 상식적 일관성 평가

초록

실제 이미지가 어떻게 보이는지 측정하는 것은 인공지능 연구에서 복잡한 과제이다. 예를 들어, 사막에서 청소기를 들고 있는 소년의 이미지는 상식에 어긋난다. 본 연구에서는 대형 시각-언어 모델(LVLMs)과 트랜스포머 기반 인코더를 활용하여 이미지의 상식 일관성을 평가하는 새로운 방법인 'Through the Looking Glass(TLG)'를 소개한다. LVLMs를 활용하여 이러한 이미지에서 원자적 사실을 추출함으로써 정확한 사실들의 혼합을 얻는다. 이후, 인코딩된 원자적 사실에 대해 컴팩트한 어텐션 풀링 분류기를 미세 조정한다. 우리의 TLG는 WHOOPS! 및 WEIRD 데이터셋에서 컴팩트한 미세 조정 요소를 활용하면서 새로운 최첨단 성능을 달성하였다.

English

Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.

거울 속을 들여다보기: 이상한 이미지에 대한 상식적 일관성 평가

Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images

초록

Support