困惑於謎題:當視覺-語言模型無法領會提示時
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
May 29, 2025
作者: Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan
cs.AI
摘要
謎語圖謎,這種通過圖像、空間排列和符號替換來編碼語言的視覺謎題,對當前的視覺-語言模型(VLMs)提出了獨特的挑戰。與傳統的圖像描述或問答任務不同,解謎語圖謎需要多模態抽象、符號推理,以及對文化、語音和語言雙關的理解。本文中,我們通過構建一個手工生成且註釋多樣化的英語謎語圖謎基準,從簡單的象形替換到空間依賴的提示(如“頭”在“腳跟”之上),來探究當代VLMs在解讀和解決謎語圖謎方面的能力。我們分析了不同VLMs的表現,結果顯示,儘管VLMs在解碼簡單視覺線索方面展現出一些令人驚訝的能力,但在需要抽象推理、橫向思維及理解視覺隱喻的任務上,它們仍面臨顯著困難。
English
Rebus puzzles, visual riddles that encode language through imagery, spatial
arrangement, and symbolic substitution, pose a unique challenge to current
vision-language models (VLMs). Unlike traditional image captioning or question
answering tasks, rebus solving requires multi-modal abstraction, symbolic
reasoning, and a grasp of cultural, phonetic and linguistic puns. In this
paper, we investigate the capacity of contemporary VLMs to interpret and solve
rebus puzzles by constructing a hand-generated and annotated benchmark of
diverse English-language rebus puzzles, ranging from simple pictographic
substitutions to spatially-dependent cues ("head" over "heels"). We analyze how
different VLMs perform, and our findings reveal that while VLMs exhibit some
surprising capabilities in decoding simple visual clues, they struggle
significantly with tasks requiring abstract reasoning, lateral thinking, and
understanding visual metaphors.