困惑于谜题：当视觉语言模型无法领会提示时

摘要

谜语画，一种通过图像、空间布局和符号替换来编码语言的视觉谜题，对当前的视觉-语言模型（VLMs）提出了独特挑战。与传统的图像描述或问答任务不同，解谜语画需要多模态抽象、符号推理以及对文化、语音和语言双关的把握。本文通过构建一个手工生成并标注的多样化英语谜语画基准，从简单的象形替换到空间依赖的线索（如“头”在“脚”上），探讨了当代VLMs在解读和解决谜语画方面的能力。我们分析了不同VLMs的表现，发现尽管这些模型在解码简单视觉线索时展现出一些令人惊讶的能力，但在需要抽象推理、横向思维及理解视觉隐喻的任务上却显著受限。

English

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.