困惑于谜题:当视觉语言模型无法领会提示时
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
May 29, 2025
作者: Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan
cs.AI
摘要
谜语画,一种通过图像、空间布局和符号替换来编码语言的视觉谜题,对当前的视觉-语言模型(VLMs)提出了独特挑战。与传统的图像描述或问答任务不同,解谜语画需要多模态抽象、符号推理以及对文化、语音和语言双关的把握。本文通过构建一个手工生成并标注的多样化英语谜语画基准,从简单的象形替换到空间依赖的线索(如“头”在“脚”上),探讨了当代VLMs在解读和解决谜语画方面的能力。我们分析了不同VLMs的表现,发现尽管这些模型在解码简单视觉线索时展现出一些令人惊讶的能力,但在需要抽象推理、横向思维及理解视觉隐喻的任务上却显著受限。
English
Rebus puzzles, visual riddles that encode language through imagery, spatial
arrangement, and symbolic substitution, pose a unique challenge to current
vision-language models (VLMs). Unlike traditional image captioning or question
answering tasks, rebus solving requires multi-modal abstraction, symbolic
reasoning, and a grasp of cultural, phonetic and linguistic puns. In this
paper, we investigate the capacity of contemporary VLMs to interpret and solve
rebus puzzles by constructing a hand-generated and annotated benchmark of
diverse English-language rebus puzzles, ranging from simple pictographic
substitutions to spatially-dependent cues ("head" over "heels"). We analyze how
different VLMs perform, and our findings reveal that while VLMs exhibit some
surprising capabilities in decoding simple visual clues, they struggle
significantly with tasks requiring abstract reasoning, lateral thinking, and
understanding visual metaphors.Summary
AI-Generated Summary